Welcome to the "How I Use GreyNoise" video series. Our goal is to highlight how analysts, researchers, and others use GreyNoise products to solve their security problems. This session features:
Welcome to the "How I Use GreyNoise" video series. Our goal is to highlight how analysts, researchers, and others use GreyNoise products to solve their security problems. This session features:
Welcome, everyone. This is another session of "How I Use GreyNoise". And I'm here with the extraordinaire Aaron DeVera. Aaron, can you introduce yourself?
Yeah, absolutely. Hey folks, I'm Aaron, my pronouns are they/them. You know, I have a slide up that kind of just says my whole deal. I've worked in security research, security consulting, usually aligned to some sort of threat intelligence or security research role. And what I've been doing lately, I'm an independent security researcher. All my work goes under backchannel. So if you see that throughout the slides, that's literally just me. My whole thing is, I really like technology, I really don't like how technology almost always gets used to sell people mattresses in a more optimal way, rather than, you know, uplifting people, uplifting underserved communities.
So a lot of what I'm going to be talking about today is going to be the product of some sort of work I've done in the counter-abuse space, I'll get very detailed into how that all works. But I'm a member of what's called the Cyber Abuse Task Force here in New York, it used to be known as the Cyber Sexual Abuse Task Force. And you know, I'm just going to be browsing the web through this whole thing. If, if you want to learn about more what we do, you can just go to cyberabuse.nyc. I'm kind of the technical subject matter expert attached to this task force. So a lot of folks will come to me for a security audit, or understand kind of what the latest and greatest in the gutter of the internet is.
So some of the experiential stuff that I'm going to be talking about when it comes to using GreyNoise for fingerprinting is going to be coming from that context, right? Someone has come to me, and we're trying to figure out who is visiting their Instagram profile ~20 times/day, or visiting their blog, or DDoS-ing their blog. So I'm gonna get pretty detailed into how that all works, what the engineering stack is behind it, and we were GreyNoise fits in. So really, I'm going to be talking about data enrichment, and the way that you might ingest GreyNoise in a manner that is outside of the front end that they've built, and actually in your own analysis environment.
Dive into that, can you tell us actually a little bit about how you heard about GreyNoise and how long you've been using it?
I have no idea how I heard about GreyNoise. I'm always looking for fun hackery things. And GreyNoise definitely fits that bill. I think all of us have heard of the type of tools like GreyNoise constantly gets compared to maybe casually a Shodan or a Censys. I think what really, the simplicity of GreyNoise is that classification engine and the tag system beneath it. That's something that, maybe you've seen that internally, if you have a really awesome network classification sort of program happening in your internal security service. But I feel there aren't too many services that just provide that sort of annotation and classification for the web, and provide that as a service. So I just thought that was fantastic. I have been using it since, so, yeah.
Some of my previous work has been at White Ops. They're now known as HUMAN. But there's a lot of companies that I think have done a really fantastic job annotating the web, it's just not necessarily as freely available as GreyNoise. So that's why it kind of fits into this tech stack here. So I'll move on to that part. This is what the architecture looks like, very loosely. There's a freebie I'm gonna give away at the end of this talk. If you want to just ignore everything I'm saying and just go straight to it. A lot of what I've built is based around the idea that you can just create your own analytics engine. So I went ahead and open sourced kind of the guts of that. So if you want to go to GitHub, I'll put the link in the chat.
Back to the slides. So when I say "edge service", that's what I'm talking about. I've deployed something to something like Cloudflare, where it's going to handle the edge request from a client who's requesting it. It's then going to go to this detection and decisioning. What happens here is, it's the special sauce, right? You can have your own sort of made-up risk analysis engine and input that there. And in today's context, we're going to be talking about using GreyNoise here. And then you get some sort of decision out of that. In the GreyNoise context, it's going to be the classification of benign, malicious, unknown. But I'm doing much more on on this side internally. There was a really great question on Twitter before this talk about, “Should we be using GreyNoise as the oracle for all of our decisioning and detection?” I mean, probably not. You should probably be using a whole host of sources. You know, there's there's a gajillion different free threat intel feeds that you can just get off of GitHub, you don't necessarily need to be paying for any of them.
Actually, yeah, that's, that's probably another topic, I'm just gonna keep browsing the web here. But if you haven't seen awesome threat intelligence, it's just fantastic. Go ahead and go to this repo and star it immediately. But it's just all the free OSINT and feeds you could want, and none of them cost 500 million or 500 thousand a year or something. So yeah, you could get really built out in this respect. So it doesn't really matter, there's some sort of detection and decisioning happening in the middle. And then because of that, you're gonna serve a response back to the client device. So all of this ideally happens in under a second, we're going to be talking a lot about milliseconds here. And in fact, the the GitHub project that I pointed to that I open sourcing today, I did a couple of benchmarks. You could see, since I'm essentially man in the middling, I'm handling a request, from from a user, going to GreyNoise to get a response, and then returning a response back to the user, depending on the GreyNoise classification. All that needs to happen in a somewhat usable amount of time. If it was 10x, slower than without the API lookup, I would probably just stitch this all together. But as you can see, it's the difference between 57 milliseconds without the lookup versus sub-400 with it. And that's kind of the best speed I could get.
But I don't know, open sourcing it, maybe somebody will look at my code and be like, “Oh, you're an awful developer, you could optimize it in some way.” So hopefully somebody wants to do that. But I mean look, if this sounds complex to you, this all fit under 100 lines of code. It's not it's not that crazy. A lot of it is just comments, too. So yeah, it's chill. And I guess, if you want to see it for yourself, you could go to rangelife.backchannel.re. And if you want to look at the GreyNoise info that's going into that, you could just go to the info endpoint. So theoretically, if I were some sort of crawler, I would just get redirected to an ASCII rickroll thing thing that would happen. Obviously, I'm a nice human so GreyNoise didn't flag me.
But you could look at the info here that we're getting from Cloudflare. First off, it's a v6, so the GreyNoise response is going to just be, you know, that's not a valid IPv4. No matter. Here's some of this good Cloudflare data. Of course, you get the request headers. What I really like about Cloudflare is how they they alter the headers to add all these great informational things. So sometimes you get the country, sometimes you also get the platform. So you can instantly, from the edge, do a lot of decisioning there, right? If you see that it's mobile, you can serve up some sort of page that is very mobile. So a lot of that just happens at the edge. It's fun. And you get some more juicy Cloudflare data. I mean, they're doing GeoIP enrichment for me, so I don't necessarily need to go do that. So yeah, super useful. That's my IP. Oh, scary.
The final part about this architecture is everything gets sunk to a data warehouse. I'm going to be talking a lot about Snowflake and using Snowflake and saying various Snowflake-y things. But that could just be Splunk. That could literally just be syncing to AWS. It could be anything, it doesn't matter. The point being is that you are putting it somewhere that you're storing. I mean, I think the OG Snowflake was on Postgres, right? Like, it could literally just be a Postgres in your room, right? Doesn't matter. The idea is that you're you're storing it somewhere, and storing it in a place where you can later do some sort of analysis on it.
So yeah, this architecture, you could spin it any which way. Let me talk about just a very real example, right? I have a client who deals a lot with abuse. They they are just this awesome group that provides therapy for victims of cyber stalking, and all sorts of things that are just kind of icky, because people on the internet can be icky. So how do I protect the site? How do I protect this place?
All this ends up in my data warehouse, and the way that it gets there is first to goes to Confluent, which is Kafka. There's a streaming bus sort of architecture here that I just didn't put here because it'd be simple, but the way that these arrows actually get to the data warehouse is that there's some sort of streaming system. In this case, it's Kafka. But in a lot of different scenarios, you're probably going to be using something similar to it, if not Kafka itself. I mean, Splunk is a streaming analytics engine, it's just proprietary. Everything in my stack so far, besides Snowflake, is fairly open source. So Kafka is divided into topics, pub/sub sort of model. And you can see the topics I have here, right? I have some enrichments here, all sorts of stuff. One thing I have is GreyNoise, right?
But say it was a very specific person in Louisiana with a very specific device that we know from an order of protection or a restraining order. If somebody who fits the bill of something that we already know about, then yeah, I'm gonna set an alert or something. So yeah, let's actually look at this data as it's coming in. I have, I'm going to spend a lot of time here in Snowflake. If you aren't used to Snowflake. I'm sorry. I love it, but I definitely have the blinders on in terms of trying to make it accessible to everybody. So if there's something you want to see you just let me know, I'll zoom in. But all of this is to say it's coming just straight from my collectors, right? Like my little web analytics engines.
So we have GreyNoise here. This is all GreyNoise data, you can see in this raw response, this is just from the API. So this is what happens when you sync all this to your own data warehouse, right? I'm no longer in the comfortability of the GreyNoise front end, which they've done an amazing job with. But rather, I'm in this kind of dungeon of my own design. So if you have data quality issues, yeah, you're gonna have a bad time. So the benefits here are pretty great. If you're really into analytics, first we can just say, “OK, get me everything where the classification was malicious.”
OK, so here's a bunch of malicious traffic that my crawlers have received from this specific collector. And so, yeah, there's one where condition, whatever. But since I'm also collecting all this other data that is not GreyNoise and proprietary to my script, and I shouldn't say proprietary, but you know, I'm doing firsthand collections on my scripts. You know, I can start comparing like, “OK, well, let's look at what I saw. Let's look at some of the device interrogation I'm running. And then let's compare that to GreyNoise.” So some of you all on the GreyNoise Slack might have seen some of the things I was doing with proxies, for instance.
Let's go ahead and jump to that actually. It's probably gonna be under Hola VPN. Let's to the GreyNoise Slack where I posted it...I'm talking about a post not too long ago that I did....almost all the posts I've been posting in the GreyNoise Slack, by the way, or directly from me just running these analytics around. So if that kind of makes sense. So here's what I posted in the GreyNoise Slack, right? Let me just get the right query...
You could also see in the time that we've been talking, we've been getting the streaming requests. So you can see my, this is again, this is Kafka, Confluent's the service, but you can run Kafka on your computer and be the streaming message bus. So in the time that we've been talking, there's been a couple of requests. Everything's been a noise false. So nothing too scary going on. So here's an example of like, “OK, I am collecting data, I know that I have my GreyNoise enrichment. Let's do a back to back.” So what I have here is query that is very specifically looking for this type of VPN service. So a lot And I'm working on all the things that I know to be Hola. And yet, the GreyNoise stuff wasn't flagged.
So that's the kind of thing where I brought it to the Slack being like, "Oh hey, here's how we can flag this type of proxy". The trick, by the way, is, for this specific one, at least is when you see this type of Luminati traffic and Luminati Hola, same thing. You're gonna see the proxy authorizations here, I'm not going to zoom in, or I'm not gonna scroll all the way because, you know, this is basically their username/passwords. But when you see that you decode it, and it's like, Luminati in the username. OK, so this is someone using residential IP services, I'm probably not going to allow them to do that my clients' websites.
So this is the kind of thing of where OK, it's in a data warehouse, that I'm collecting data, I have this great enrichment from GreyNoise, I have this awesome enrichment from IPinfo, VirusTotal, etc. And I'm storing the classification here. So I can do the analysis on my own with my own data. You know, again, GreyNoise is a great interface for searching GreyNoise data. But really, where this where this magic happens is doing comparisons like this, where I'm looking at the data I collected, looking at the data GreyNoise has, and seeing, OK, where is a situation where I'm going to trust what I'm collecting maybe a little bit more in the risk confidence area, than what GreyNoise has for this particular session.
One of the great questions from Twitter, I think they're here on the call now. But they asked, “Should we always just be using GreyNoise as the Oracle for this sort of detection mechanism?” And, and the answer is ideally, you have some sort of confidence interval, based off a whole bunch of data. Use an awesome threat intel GitHub repo, use GreyNoise, use anything you can that is free, and build out a system. And hopefully, you know, if you want to give Cloudflare Workers a try, that GitHub thing I'm open sourcing today, but it might be a good place to start. Because you can basically go into here and replace out this, maybe I shouldn't be advising people to not use GreyNoise, but you get to make an API request to anything here, right? Like it could go to Spamhaus (boo!) or it could go to VirusTotal, it could literally just go to anything. And you get that response of, should we block it?
So yeah, hopefully this boilerplate kind of helped you out. And how you might use this in some sort of pipeline, where you're doing real-time enrichment in such a way. And of course, and hopefully you're storing it to a data warehouse where you can actually analyze it...
I just want to drop in and say a couple things that I think are important to highlight. Obviously, this session is called "How I Use GreyNoise". So we'd really like you for you to use our data, and have a lot of confidence. But to your point, and I think, Adam, what you said on Twitter is really pertinent. And I think just the core of working in intel is that your data will always be the top source you can't, you can't get any better than your own telemetry. So of course, you know, GreyNoise and other vendors like that, that type of data secondary, but for instance, this and your example, hopefully just adding some color and context is really valuable and double-down. You can see a lot more fields once you go into the Visualizer. So I know we were looking at the Hola VPN too, and in the community Slack, there was a good conversation with Tom, who is the head of Spur. So I don't think they currently tag Hola, but I was just looking at some of the IPs and I know that we were able to identify some of them as VPNs. But this is a great example of community-driven research that is helping us internally at GreyNoise to go back and review some of our data , and add fields and continue to collect on that. So just wanted to highlight that they've been an invaluable asset and helping us do so.
Is this a good time to open up for questions? Aaron, did you have more stuff you wanted to cover?
That was the bulk of it. And again, if if everyone wants to see something specific I can just go back and zoom in. It's all just it's all just easy to look up so...
Well, we can take a moment then to answer any questions. Feel free to unmute, or you can drop them in the chat. And I can dictate them. Whichever. I ask that you raise your hand.
I have a question about VPN, people coming in from a VPN. We have legitimate users we have coming into to corporate network from various VPNs. And we get alerts about these. And I think currently we're not doing as much for them as we can. We could have someone from the SOC reach out to that user and say, "Are you're using a VPN?" But that gets kind of tedious. Do you have any ranking or insight, like a way of saying...is there something that you're aware of that you can do with GreyNoise to help evaluate the likelihood of the VPN ingress being harmful?
Yeah, so that's a great question, because there are tons of VPN tags on GreyNoise. So right now, I'm just like, if you haven't looked at this view in GreyNoise, by the way, it's super helpful. All those tags that the research team at GreyNoise is annotating on the traffic that they see is just on the tags view. And a lot of times I've referenced this and and see if there's a certain VPN tag that's around. There's also the VPN service. And as it auto populates, you can see some of it, right? let's lie look at...whatever. There's an anonymous VPN, right? That doesn't sound like a VPN I want to allow. So we have some fairly high confidence when GreyNoise is able to say, “It's not just a VPN, but THIS VPN, where we see malicious traffic from it all the time. In this particular one, it looks like there's six sessions that are malicious, 24 unknown.” And again, I'm seeing all this in real time. And then I also get it sunk to my data warehouse, so I can basically use any of these decisions from GreyNoise or tags from GreyNoise as part of my strategy, and in determining whether or not this is a malicious session that's about to happen.
So Supriya was just saying a lot of this comes from GreyNoise, a lot of it comes from Spur analytics (Spur.us, I believe), which is also doing some amazing...but GreyNoise they are doing awesome work annotating the web. And, of course, we're not going to know every single VPN IP address out there, especially when you're talking about the ones I was talking about with Luminati, where they're basically hijacking other people's IPs, residential IPs. So the best that we can do is, you know, have some sort of strategy where we can evaluate the risk of the session and have some sort of threshold or confidence interval for it. The thing is, when GreyNoise is able to say, or this is a VPN, and it's classified malicious to me, that's all I need, right? It's a highly reliable source, it's consistent. And I don't need to do any further analysis if GreyNoise is telling me that. It's really those cases where, you know, GreyNoise might have seen it, and it's unknown, so I need to do a little bit extra digging to see if I want to allow it.
One of the enrichments I have here is IPinfo, which is just my favorite GeoIP service, but one of the best parts about it is their host name identification. So, Adam, you brought up corporate VPNs, right? Well, I can see the corporate VPN hosted on AWS for some of these services. And the way I can do that is by looking at what IPinfo is saying is the host name. So here we have a Pinterest crawler, right? And again, this is the raw response from IPinfo, not GreyNoise, but what I'm going to say is, “OK, well, it's AWS, and we know that because of the ASN IP, let's go ahead and see what GreyNoise says about that IP.” So that's why sources are also definitely useful here. But you know, it gives us a little bit of intelligence. It's like, in that hostname that we see from IPinfo, it's crawl pinterest.com, right? So, I don't need to do too much guessing to say, “OK, this is a crawler.”
So I think some of the, the things I worry about most is people running self-hosted VPN on AWS that just looked like corporate traffic. And what that would look like is one of these sessions, right? We have its AWS ASN, the IP is definitely from EC2, we know that from the host name. But this could be somebody running a WireGuard protocol for themselves. And there's just next to nothing I know about that session from that sort of data to tell me of whether or not this is a good actor or a bad actor. So, you know, that's probably a time where I would let the session run, look at the type of data I'm collecting, and look at what they're doing on the site. If this session is happening at the same time as some sort of DDoS, maybe they're part of that DDoS. If it's just scrolling around the website, and up and down, and going through the navigation, it's more than likely going to be some regular person just reading about the site. I can't make too much distinction on my own in that situation. And as security operators, ideally, we have data, either historical that we own, or enrichment, like GreyNoise, to lean on and help us make these decisions of whether or not this was a good session or a bad session.
But at the end of the day, it's it's usually not in a nice binary. It's more just like a confidence interval of risk. So yeah, it's super interesting, because like corporate VPNs will always just look like this. And it's a big, “Well I don't know about it.” So it was a really good question that you asked, Adam. I do have the bonus for you in looking for rogue AWS devices...
Did we answer your question, Adam?
It did, it actually expanded on what I was wondering about. Our cas is more people using ExpressVPN or whatever coming in to a corporate network. So they might actually be using their VPN, their sort of private or consumer VPN, to then come through our corporate VPN to get into the corporate network. And, of course, we're flagging it, DataCamp has been showing up a lot lately. And that that tag that you put in, I mean, let's pretend that I'm at a very early stage, and I'm a wide-eyed neophyte when it comes to GreyNoise. I'm just trying to expand my learning here. But yeah, it did not only answer my question, because super, you've added that tag into the chat. Very useful. And, Aaron, you've also pointed out a whole new sets of questions that I could have asked about people yet setting up their own VPN in AWS. So yeah, I think I've got a got a lot to think about already. Thank you. And yet that that does lead nicely to that bonus item that you talked about. So I'll be quiet now. Thank you.
It's all good. Yeah, so that the query I just posted here is about rogue AWS. One of the questions from Twitter was about things like AWS sessions, whether not knowing this is going to be malicious or not. Let's actually just take out..I have the classification malicious here in this query...but let's just take that out and see, OK, what percentage of the time, how often is GreyNoise seeing AWS and it's not getting the malicious...so these are just two AWS ASNs. And you can see down here it's, OK, two are benign from what GreyNoise is seeing. We've got about 6,000 that are malicious, and the unknowns are a little under 113,000.
So does that more or less, if we understand GreyNoise's global scale, we can basically say, “OK, then this percentage of the time, like six in 113, we're going to see a malicious AWS session. So that, again, adds to that risk tolerance sort of question of, “OK, we see it's AWS. According to GreyNoise it has this X and N changes of being malicious, if it is actually getting the classification malicious flag from GreyNoise.” Then it's like, “OK, yeah, drop it, drop the packet.”
So, again, one of those things where it's like, if the classification "malicious" is there, I just absolutely trust GreyNoise. Like, why not? It's really like for those unknowns that we've been talking about this past ~20 minutes, which is the edge case, adding more data as much as you can to better understand it, and hopefully, posting about it in a place like the GreyNoise Slack, where we can all be like, "Ah, yes, bad." and we can add it to our own internal tracking. But yeah, so that query post in the chat is the one that we're looking at here, it's classification malicious, and this one is unknown. But when you look at some of these unknown, it's like, “OK, this one has a Paramiko tag, that's an SSH brute force.” Or maybe, but likely. Some of these are not necessarily things I would want to allow. And when you look at some of your own data of what you've seen, maybe from that IP range, and it's all crawling, then it's like, “OK, yeah, maybe I shouldn't allow that.”
And again, host names are super useful. It's not going to happen on all of IPinfo's traffic or GreyNoise's traffic, but when it is there, it is very useful, because we can see this is this IP, while classified as unknown, looks to be some sort of analytics subdomain of whatever service this is, perhaps it's also a crawler. We can make some sort of insights from how reverse DNS has mapped this IP to some domain somewhere. So yeah, always a super useful thing. Again, it's all about confidence intervals. Some of these might be just extremely non informational. And that's okay. You're just using the data as you get it to do your best.
I'll also point out that, that data, if you go back to the IPinfo page has a note that it's spoofable. So again, this is activity that we're seeing that may or may not be from this actual IP. Our credo at GreyNoise is the more information we can kind of throw at analysts to just add any type of context and any information. To be as helpful when, you know, as you brought up, is that risk and decision matrix.
Yeah. Any more questions? Anything else people want to see here? Has anyone visited my little domain and gotten anything other than, "Hello normal, nice human"?
I had some fun with it?
Oh yeah? Did you get the rickroll?
I got a redirect to a domain, but it failed to resolve because the public domain, the DNS record, wasn't resolvable. But I did get the redirect, and when you were querying the data, the top result was me. And it was like, "Haha! There I am."
Nice. Yeah. If when you go into the code, when I'm doing is, I'm redirecting to my friend's site, and they have an ASCII referral thing, but I didn't actually look into how runnable that even is in the browser. So again, you could just drop packets, if you think that something is going to be a bad session. I mean, that's what the whole point of this is using GreyNoise to understand is the classification malicious? Should we drop this? So yeah. I guess I'll just continue posting some of these links so you have them handy, but yeah, that's that's about it.
Again, the whole point here was to really talk about integrating it directly with your data. That's, that's only as good as if you have an interface where you enjoy being in it. I enjoy being in Snowflake. But GreyNoise has put a lot of work making the front end very useful and add just the ease of going through here and being like, OK, organization:datacamp and having a lot of those fields suggest and populate for me, it's just super easy. So I really enjoy using using this to start setting up what I'm about to code and integrate.
This is amazing. Aaron, will you do me a favor? Will you go up, scroll up and go to Resources, Documentation?
And I guess this is, this is where I started the project, I read this first.
Of course, as everyone, does read our docs. If you go to Integrations, and you can kind of scroll to the bottom. But long story short, I just kind of wanted to show everyone that we obviously have a ton of integrations. GreyNoise has a fabulous UI, and I'm very proud of our Visualizer. But the idea is that you should be able to use GreyNoise wherever you live for most of the time. So for Aaron, if that's Snowflake, like by all means, Snowflake. And I think that there's at least 30 integrations on this list. And it just keeps growing. I think Brad Chiappetta is on the call too, and Brad is our integrations expert. But we'll be adding this Cloudflare integration as one of the community contributed integrations. If there is a project or an integration that you have developed, or you know someone that has, please let me know and I will be sure to add it to this list as well. We're trying to be as exhaustive as possible and make sure that we're giving proper credit to those who have worked hard to develop things like that.
And it also allows other people to go in and, you know, push contributions or, you know, make any changes or additions as especially as our API continues to develop, and our data continues to grow. So fantastic! Well, Aaron, thank you so, so much for this, this was really, really valuable and I think kind of shows yet another way that you know, people can use GreyNoise. So again, I'm happy to take questions offline. You can join us on the community Slack. If you're not already there. I can drop the link right now.
Yeah, thanks all for attending and thanks GreyNoise for having me. Like, I know I could talk about some of the data very granularly so again, just ping me or ping Supriya. If there's anything here that you want to follow up on…we were having some of the conversation on Twitter, we were also in the GreyNoise Slack. The GreyNoise slack is somewhere where there's just a channel if like, “Are you seeing this?” If you ever find something cool, just post it. It'll be a very fun convo.
Absolutely. So this has been yet another "How I Use GreyNoise" session. If you are actually using GreyNoise or someone you know is, and you want to do one of these sessions, you can reach out to me at email@example.com. My name is Supriya, by the way. I don't think I ever said that. But it's nice to meet all of you. And I hope you join us on on Slack or on Twitter, or anywhere that you feel that you can be reached. I look forward to it. See you later.