Get the latest blog articles delivered right to your inbox.
GreyNoise Research
In-depth analysis and trend reporting from the GreyNoise Research team. Includes detection engineering insights, reverse engineering work, and white papers that surface emerging threat trends based on our telemetry — helping defenders stay ahead of risks that are often overlooked or not yet widely known.
Subscribe to GreyNoise
Get the latest blog articles delivered right to your inbox.
It’s well known that the window between CVE disclosure and active exploitation has narrowed. But what happens before a CVE is even disclosed?
In our latest research “Early Warning Signals: When Attacker Behavior Precedes New Vulnerabilities,” GreyNoise analyzed hundreds of spikes in malicious activity — scanning, brute forcing, exploit attempts, and more — targeting edge technologies. We discovered a consistent and actionable trend: in the vast majority of cases, these spikes were followed by the disclosure of a new CVE affecting the same technology within six weeks.
This recurring behavior led us to ask:
Could attacker activity offer defenders an early warning signal for vulnerabilities that don’t exist yet — but soon will?
The Six-Week Critical Window
Across 216 spikes observed across our Global Observation Grid (GOG) since September 2024, we found:
80 percent of spikes were followed by a new CVE within six weeks.
50 percent were followed by a CVE disclosure within three weeks.
These patterns were exclusive to enterprise edge technologies like VPNs, firewalls, and remote access tools — the same kinds of systems increasingly targeted by advanced threat actors.
Why This Matters
Exploit activity may be more than what it seems. Some spikes appear to reflect reconnaissance or exploit-based inventorying. Others may represent probing that ultimately results in new CVE discovery. Either way, defenders can take action.
Blocking attacker infrastructure involved in these spikes may reduce the chances of being inventoried — and ultimately targeted — when a new CVE emerges. Just as importantly, these trends give CISOs and security leaders a credible reason to harden defenses, request additional resources, or prepare strategic responses based on observable signals — not just after a CVE drops, but weeks before.
What’s Inside the Report
The full report includes:
A breakdown of the vendors, products, and GreyNoise tags where these patterns were observed.
Analysis of attacker behavior leading up to CVE disclosure.
The methodology used to identify spikes and establish spike-to-CVE relationships.
Clear takeaways for analysts and CISOs on how to operationalize this intelligence.
This research builds on our earlier work on resurgent vulnerabilities, offering a new lens for defenders to track vulnerability risk based on what attackers do — not just what’s been disclosed.
Computers don’t understand words. They don’t understand verbs, nouns, prepositions, or even adjectives. They kinda understand conjunctions (and/or/not), but not really. Computers understand numbers.
To make computers do things with words, you have to make them numbers. Welcome to the wild world of text embedding!
In this blog I want to teach you about text embedding, why it’s useful, and a couple ways to do it yourself to make your pet project just a little bit better or get a new idea off the ground. I’ll also describe how we’re using it at GreyNoise.
Use Cases
With LLMs in the mix, modern use cases of text embedding are all over the place.
Can’t fit all your text into a context window? Try embedding and searching the relevant parts.
Working on sentiment analysis? Try to build a classifier on embedded text
Looking to find similar texts? Create a search index on embedded values
Want to build a recommendation system? Figure out how things are similar without building a classification model
In all of these, we’re encoding a piece of text into a numerical vector in order to do basic machine learning tasks against it, such as nearest neighbors or classification. If you’ve been paying attention in class, this is just feature engineering, but it’s unsupervised and on unstructured data, which has previously been a really hard problem.
How to method 1: Cheat mode
Lots of large models will offer sentence level embedding APIs. One of the most popular ones is OpenAI https://platform.openai.com/docs/guides/embeddings. It doesn’t cost a ton, probably under $100 for most data sets, but you’re dependent on the latency of external API calls and the whims of another company. Plus, since it’s a GPT model it is based on the encoding of the last word in your text (with the cumulative words before it), that doesn’t feel as cohesive as what I’m going to suggest next. (This is foreshadowing a BERT vs GPT discussion)
How to method 2: Build your own
Side quest: GPT vs BERT
GPT stands for Generative Pre-trained Transformer. It is built to predict the next word in a sequence, and then the next word, and then the next word. Alternately, BERT stands for Bidirectional Encoder Representations from Transformers. It is built to predict any word within a set piece of text.
The little bit of difference between them, because they both use Transformers, is where they mask data while training. During the training process a piece of the text is masked, or obscured, from the model and the model is asked to predict it. When the model gets it right, hooray! When the model gets it wrong, booo! These actions either reinforce or change the weights of the neural network to hopefully better predict in the future.
GPT models only mask the last word in a sequence. They are trying to learn and predict that word. This makes them generative. If your sequence is “The cat jumped” it might predict “over”. Then your sequence would be “The cat jumped over” and it might predict “the”, then “dog”, etc.
BERT models mask random words in the sequence, so they are taking the entire sequence and trying to figure out the word based on what came before and after (bidirectional!!!). For this point, I believe they are better for text embedding. Note, the biggest GPT models are orders of magnitude bigger than the biggest BERT models because there is more money in generation than encoding/translation, so it is possible GPT4 does a better job at generic sentence encoding than a home grown BERT, but let's all collectively stick it to the man and build our own, it’s easy.
Figure 1: BERT based masking
Figure 2: GPT based masking
Main quest: Building a text encoder
If your data is perhaps not just basic English text data, building your own encoder and model might be the right decision. For GreyNoise, we have a ton of HTTP payloads that don’t exactly follow typical English language syntax. For this point, we decided to build our own payload model and wanted to share the knowledge.
There are two parts of a LLM. The same parts you’ll see in HuggingFace models (https://huggingface.co/models) and everywhere else. A Tokenizer and a Model.
Tokenizer
The tokenizer takes your input text and translates it to a base numerical representation. You can train a tokenizer to learn vocabulary directly from your dataset or use the one attached to a pre-trained model. If you are training a model from scratch you might as well train a tokenizer (it takes minutes), but if you are using a pre-trained model you should stick with the one attached.
Tokens are approximately words, but if a word is over 4-5 characters it might get broken up. “Fire” and “fly” could each be one token, but “firefly” would be broken into 2 tokens. This is why you might often hear that tokens are “about ¾ of a word”, it’s an average of word to token. Once you have a tokenizer it can translate a text into integers representing the index of the tokenizer set.
“The cat jumped over” -> 456, 234, 452, 8003
Later, supposing we have a model, if you have the output 8003, 456, 234, 452 (I reordered on purpose) you could translate that back to “over the cat jumped”
The tokenizer is the translation of a numeric representation to a word (or partial word) representation.
Model
With a tokenizer, we can pass numerical data to a model and get numerical data out, and then re-encode that to text data.
We could discuss the models, but others have done that before (https://huggingface.co/blog/bert-101) All of these LLM models are beasts. They have basic (kinda) components, but they have a lot of them, which makes for hundreds of millions to billions of parameters. For 98% of people, you want to know what it does, the pitfalls, and how to use it without knowing the inner workings of how transformer layers are connected to embedding, softmax, and other layers. We’re going to leave that to another discussion. We’ll focus on what it takes to train and get a usable output.
The models can be initialized with basic configs and trained with easy prompts. Thanks to the folks at Huggingface (you the real MVP!). For this we are going to use a RoBERTa model (https://huggingface.co/docs/transformers/model_doc/roberta). You could use a pre-trained model and fine-tune it, however, we’re just going to use the config and build the whole model from random/scratch. A very similar workflow is usable if you want to use a pre-trained model and tokenizer though. I promise I won’t judge.
Create your own list of text you want to train the encoder and model on. It should be at least 100k samples.
If you have created your data set as `str_data` and set a model path as a folder where you want to save the model and tokenizer, you can just do:
tokenizer = create_tokenizer(model_path, str_data[0:50000]) ## you don’t really need more than 50k to train the tokenizer
model = create_model_from_scratch(model_path, tokenizer)
This will create the tokenizer and model. The tokenizer is usable at this state. The model is just random untrained garbage though.
When you’re ready for the first train, get into the habit of loading the tokenizer and model you created and training it, this is what people call “checkpoints”.
tokenizer = RobertaTokenizerFast.from_pretrained(model_path, max_len=512)
model = RobertaForMaskedLM.from_pretrained(model_path)
model = train_model(tokenizer, model, model_path, str_data[0:100000]) ## train on however much you want at a time, there is a whole other discussion about this, but give it at least 100k samples.
When you want to retrain or further train, which at this point is also called fine-tuning, just load it up and go again with new or the same data. Nobody is your boss and nobody really knows what is best right here.
Note: You’re going to want to use GPUs for training. Google Colab and Huggingface Notebooks have some free options. All in, this particular model will require 9-10GB of GPU memory, easily attainable by commodity hardware.
Evaluating
Large Language Models do not have a great list of sanity checks. Ironically most benchmarks are against other LLMs. For embeddings we can do a little better to work toward your personal model. When you take two samples that you think are similar and run them through the model to get the embeddings, you can calculate how far they are apart with either cosine or Euclidean distance. This gives you a sanity check of if your model is performing as expected or just off the rails.
For Euclidean distance use:
import numpy as np
euclidean_dist = np.linalg.norm(embedding1 - embedding2)
For cosine distance use:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))
How We’re Using it at GreyNoise
We’re early adopters of LLM tech at GreyNoise, but it is hard to put it in the hands of users responsibly. We basically don’t want to F up. We have an upcoming feature called NoiseGPT that takes natural language text and turns it into GNQL queries. Begone the days of learning a new syntax for just figuring out what the hell is going on.
We also have an in-development feature called Sift, a way to tease out the new traffic on the internet and describe it for users. This would take the hundreds of thousands of http payloads we see every day and reduce it to the ~15 new and relevant ones and also describe what they are doing. EAP coming on that soon.
Plus, if you think of any great ideas we should be doing, please hit us up. We have a community slack and my email is below. We want to hear from you.
Fin
With these tips I hope you’re able to create your own LLM for your projects or at least appreciate those that do. If you have any questions please feel free to reach out to daniel@greynoise.io, give GreyNoise a try (https://viz.greynoise.io/), and look out for features using these techniques in the very near future.
On Monday, May 1, 2023, CISA added CVE-2021-45046, CVE-2023-21839, and CVE-2023-1389 to the Known Exploited Vulnerabilities (KEV) list. For all three CVEs, GreyNoise users had visibility into which IPs were attempting mass exploitation prior to their addition to the KEV list. GreyNoise tags allow organizations to monitor and prioritize the handling of alerts regarding benign and, in this case, malicious IPs.
Apache Log4j2 contains a deserialization of untrusted data vulnerability due to the incomplete fix of CVE-2021-44228, where the Thread Context Lookup Pattern is vulnerable to remote code execution in certain non-default configurations.
December 9, 2021
May 1, 2023
CVE-2023-21839
Oracle WebLogic Server contains an unspecified vulnerability that allows an unauthenticated attacker with network access via T3, IIOP, to compromise Oracle WebLogic Server.
March 6, 2023
May 1, 2023
CVE-2023-1389
TP-Link Archer AX-21 contains a command injection vulnerability that allows for remote code execution.
April 25, 2023
May 1, 2023
Bonus Update:
On Thursday, April 27, 2023, GreyNoise released a tag for the critically scored CVE-2023-21554, QueueJumper, a Microsoft message queuing remote code execution vulnerability.
As of this publication, we have not observed mass exploitation attempts, but have observed >600 IPs that are attempting to discover Internet-facing Microsoft Windows devices that respond over Microsoft Message Queuing (MSMQ) binary protocol.
Crawlers finding public, unsecured environment files continue to be used to compromise organizations.
On Tuesday, April 25, 2023, GreyNoise is changing how we classify environment file crawlers from unknown intent to malicious intent. At the time of publication, this change will result in the reclassification of over 11,000 IPs as malicious. Users who use GreyNoise’s malicious tag to block IPs based on malicious intent will see an increase in blocked IPs.
Background
An environment file crawler is a bot that scours the internet for publicly available env files. The use of these files have been popular for over a decade and are used to pass dynamic environmental variables to software and services.
Environment files are dotfiles; dotfiles are hidden files that are hidden from the user by default but are editable by any text editor and contain configuration settings for various applications. An example of an environment file is:
They almost always contain sensitive data such as authentication information (ex. keys or passwords) and often their specific connection paths. For this reason, env files should never be exposed publicly; anyone who obtains the file can potentially access sensitive information. Adding insult to injury, organizations often are unaware that they are exposing these files to the public, and these crawlers have been historically overlooked.
What is GreyNoise changing?
For years, GreyNoise has monitored env scanners and classified them as unknown intent. However, we continuously strive to enhance our datasets to safeguard organizations and increase the effectiveness of SOCs; thus, we have decided to reclassify these crawlers as malicious.
Click/tap here for more information on GreyNoise classifications.
The reclassification of intent will affect the following tags:
These files should never be publicly exposed since they typically contain sensitive information; the internet noise generated by the constant searching for these files is indicative of the scale of opportunistic attackers looking for credentials.
Using environment files to compromise organizations is a well-established tactic
Organizations should take proactive measures to regularly look for exposed .env files; scanning once won’t cut it as they can appear at any time. Searching for unsecured env files should be a part of an organization's vulnerability management program. If you do find a publicly available .env file for your organization, it is imperative that you immediately remediate the exposure and rotate any credentials that were leaked. GreyNoise will continue to review the classifications of our tags to ensure their efficacy.
Sign up for a free GreyNoise account or request a demo to see how GreyNoise can help provide immediate protection from threats like these, especially when activity mutates from "unknown" or "benign" to "malicious.”
OpenAI ChatGPT has recently released a new feature that allows for plugins to fetch live data from various providers. This feature has been designed with "safety as a core design principle", which means that the OpenAI team has taken steps to ensure that the data being accessed is secure and private.
However, there are some concerns about the security of the example code provided by OpenAI for developers who want to integrate their plugins with the new feature. Specifically, the code examples utilize a docker image for MinIO RELEASE.2022-03-17. This version of MinIO is vulnerable to CVE-2023-28432, which is a security vulnerability resulting in information disclosure of all environment variables, including MINIO_SECRET_KEY and MINIO_ROOT_PASSWORD.
While we have no information suggesting that any specific actor is targeting ChatGPT example instances, we have observed this vulnerability being actively exploited in the wild. When attackers attempt mass-identification and mass-exploitation of vulnerable services, “everything” is in scope, including any deployed ChatGPT plugins that utilize this outdated version of MinIO.
To avoid any potential data breaches, it is recommended that users upgrade to a patched version of MinIO (RELEASE.2023-03-20T20-16-18Z) and integrate security tooling such as docker-cli-scan or use Github’s built-in monitoring for supply chain vulnerabilities, which already contains a record referencing this vulnerability.
While the new feature released by OpenAI is a valuable tool for developers who want to access live data from various providers in their ChatGPT integration, security should remain a core design principle.
To ensure we have as much visibility into activity on the internet as possible, we regularly deploy new sensors in different “geographical” network locations. We’ve selected two sensors for a short “week in the life” series to give practitioners a glimpse into what activity new internet nodes see. This series should help organizations understand the opportunistic and random activity that awaits newly deployed services, plus help folks understand just how little time you have to ensure internet-facing systems are made safe and resilient.
We initially took a look at what the sources of "benign" traffic are slinging your way. Today, we're going to look at the opposite end of the spectrum. We stripped away all the benign sources from the same dataset and focused on incoming tagged traffic from malicious or unknown sources. After all, these detection rules are the heart and soul of the GreyNoise platform, and are also what our customers and community members depend upon to keep their organizations safe.
The "not-so-benign" perspective
If we hearken back to our previous episode, it took over an hour for even the best-of-the-best of the "benigns" to discover our freshly deployed nodes. This makes sense, since there aren't too many legitimate organizations conducting scans, and they do not have infinite resources. Sure, they could likely spare some change to scan more frequently, but they really don't need to.
In contrast, we tagged 8,697 incoming IP addresses in that first week, and the first packet of possible ill-intent appeared ten seconds after the nodes were fully armed and operational. However, the first tagged event — RDP Alternative Port Crawler — was seen three hours later. The difference between those two events lies in one of our core promises: our tags are 100% reliable. We don't just take every IP address hitting our unannounced sensor nodes and shove it into a list of indicators of compromise (IoCs).
Tagged Malicious Traffic Started Coming In As Soon As The Sensors Were Functional
217,852 total malicious/unknown events encountered during the ~7.8 day sampling period.
The above chart is the raw, non-benign connection data to those sensors for that week. You're likely wondering what those spikes are. We did to!
Let's look at some summary data by tallies of:
autonomous system organization (aso) — the name of the network the connections came from
geolocated country
destination port
source IPv4
total connections
The Four Largest "Spike" Hours Had Mostly Similar Characteristics
The August 28th malicious traffic spike (for an hour) focused mainly on SMB exploits, and originated from the "Data Communication Business Group" autonomous system in Taiwan. It is odd that we saw so little other activity, and that the port volume was an order of magnitude less. There are any number of reasons for this. Given where these sensors are (which we're not disclosing), it could have been a day of deliberate country network isolation. Or, it could just mean that the botnet herders were super-focused on SMB.
Over the course of those seven days, non-benign nodes hit over thirteen-thousand ports, and you can likely guess which ones made the top of the list.
We Saw The Usual Suspects Rise To The Top Of 13,576 Ports
Port numbers associated with Telnet, RDP, SSH, and SMB were, by far, the most common.
Since we called out one autonomous system, we should be fair and call them all out.
If You've Ever Stared At An IPv4 IoC List, You Definitely Recognize These Folks
These are very common autonomous systems to see in malicious attack logs. What should concern you, however, is that "high reputation" sources such as Linode, OVH SAS, Google LLC, and Microsoft Corporation are all in the visible treemap cells. Even if you don't consider them as high reputation sources, you cannot permanently block communication to/from them. These are all hosting providers, and we see other hosting providers with decent IP reputation also hosting malicious traffic sources. The hourly updated nature of our API-downloadable block lists, combined with the scheduled roll off of IP addresses when they stop sporting malicious activity after a while, means you can make your organization safe from these IPs while they are trying to do harm.
It's all about the tags
What you, and we, truly care about is the tagged, non-benign traffic.
The Tagged Traffic Distribution Takes A Familiar Shape
Of the 115 identified tags, those associated with RDP, SMB, and MS SQL attacks topped the list, along with scrapers looking for useful information. If you still haven't removed RDP from your perimeter, please stop reading and take the opportunity to do so now.
Head on over to the Observable notebook that houses all these charts and data for a more interactive version of the chart and data tables.
You can see the aforementioned yellow SMBv1 Crawler August 28th spike right after "12 PM".
Key Takeaways
When you deploy a new internet-facing system, you have only seconds until unwanted traffic comes knocking on your door. After that, there's a constant drumbeat, and sometimes even an entire off-key orchestra, of unsought after:
benign traffic attempting to maintain an inventory of internet-connected devices and services;
malicious traffic with a laundry list of goals to achieve; and,
unknown traffic that could be something we and the rest of the cybersecurity community have not yet identified as malicious, but is definitely something your apps and other services might not be able to handle.
You can sign up for a GreyNoise account to start exploring the tags and networks identified in this post, and check out some of our freshly minted new features such as IP Similarity — which lets you hunt for bad actors exhibiting behavior similar to ones we've tagged — and, IP Timeline — where you can see what sources have been up to, and how their behavior has changed over time.
Microsoft’s Patch Tuesday (Valentine’s Edition) released information on four remote code execution vulnerabilities in Microsoft Exchange, impacting the following versions:
Exchange Server 2019
Exchange Server 2016
Exchange Server 2013
Attackers must have functional authentication to attempt exploitation. If they are successful, they may be able to execute code on the Exchange server as SYSTEM, a mighty Windows account.
Exchange remote code execution vulnerabilities have a bit of a pattern in their history. This history is notable due to authentication being a requirement for exploitation of these newly announced vulnerabilities.
CVE-2023-21529, CVE-2023-21706, and CVE-2023-21707 have similarities to CVE-2022-41082 due to them all requiring authentication to achieve remote code execution, which GreyNoise covered back in September 2022. Readers may know those previous September 2022 vulnerabilities under the “ProxyNotShell” moniker, since an accompanying Server-Side Request Forgery (SSRF) vulnerability was leveraged to bypass the authentication constraint. “As per our last email” we noted this historical pattern of Exchange exploitation in prior blogs as well as tracked recent related activity under the Exchange ProxyNotShell Vuln Check tag which sees regular activity.
Shadowserver, a nonprofit organization which proactively scans the internet and notifies organizations and regional emergency response centers of outstanding exposed vulnerabilities, noted that there were over 87,000 Exchange instances vulnerable to CVE-2023-21529 (the most likely vulnerability entry point of the four new weaknesses).
As of the publishing date of this post, there are no known, public proof-of-concept exploits for these new Exchange vulnerabilities. Unless attackers are attempting to bypass web application firewall signatures that protect against the previous server-side request forgery (SSRF) weakness, it is unlikely we will see any attempts to mass exploit these new weaknesses any time soon. Furthermore, determined attackers have been more stealthy when it comes to attacking self-hosted Exchange servers, amassing solid IP address and domain inventories of these systems, and retargeting them directly for new campaigns.
GreyNoise does not have a tag for any of the four, new Exchange vulnerabilities but is continuing to watch for emergent proof-of-concept code and monitoring activity across the multi-thousand node sensor network for anomalous Exchange exploitation. Specifically, we are keeping a keen eye on any activity related to a SSRF bypass or Exchange credential brute-force meant to meet the authentication constraints needed by an attacker to leverage these vulnerabilities.
GreyNoise researchers will update this post if and when new information becomes available.
Given the likely targeted nature of new, malicious Exchange exploit campaigns, you may be interested in how GreyNoise can help you identify targeted attacks, so you can focus on what matters to your organization.
UPDATE 2023-02-14: in response to an inquiry, GreyNoise researchers went back in time to see if there were exploit attempts closer to when CVE-2021-21974 was released.
From January 1st 2021 through June 1st 2021, 2 IP's were observed exploiting CVE-2021-21974, each active (as observed by GreyNoise sensors) for a single day.
Active 2021-05-25, 45[.]112[.]240[.]81
Active 2021-05-31, 77[.]243[.]181[.]196
In recent days CVE-2021-21974, a heap-overflow vulnerability in VMWare ESXi’s OpenSLP service has been prominently mentioned in the news in relation to a wave of ransomware effecting numerous organizations. The relationship between CVE-2021-21974 and the ransomware campaign may be blown out of proportion. We do not currently know what the initial access vector is, and it is possible it could be any of the vulnerabilities related to ESXi’s OpenSLP service.
The Security Community seems to be focusing on a single vulnerability. GreyNoise believes that CVE-2021-21974 makes sense as an initial access vector, but are not aware of any 1st party sources confirming that to be the case. We encourage defenders to remain vigilant and not accept every vendor at their word (including us).
The objective of the following document is to provide clarity to network defenders surrounding the ransomware campaign as it relates the the following items:
Attribution of exploitation vector to a specific CVE
Metrics of vulnerable hosts
Metrics of compromised hosts
How do you “GreyNoise” an unknown attack vector?
Attribution to a specific CVE
CVE-2021-21974 is a heap-overflow vulnerability in ESXi’s OpenSLP service.
CVE-2020-3992 is a use-after-free vulnerability in ESXi’s OpenSLP service.
CVE-2019-5544 is a heap overwrite vulnerability in ESXi’s OpenSLP service.
Back in October 2022, Juniper Networks wrote a blog regarding the potential usage of CVE-2019-5544 / CVE-2020-3992 as part of an exploitation campaign. Due to log retention on the compromised server they were unable to confidently attribute which specific vulnerability resulted in successful exploitation. Instead they focused their blog on the details of the backdoor that was installed post-exploitation.
On February 3rd, 2023 the cloud hosting provider OVH published a notice regarding an active ransomware campaign effecting many of their ESXi customers, hereafter referred to as “ESXiArgs” due to the ransomware creating files with an extension of .args. As part of their notice they provide the following quote:
According to experts from the ecosystem as well as authorities, the malware is probably using CVE-2021-21974 as compromission vector. Investigation are still ongoing to confirm those assumptions.
On February 6th, 2023 VMWare published a security blog acknowledging the “ESXiArgs” campaign and stated:
VMware has not found evidence that suggests an unknown vulnerability (0-day) is being used to propagate the ransomware used in these recent attacks.
In summary, while there are many 3rd party sources of intelligence directly attributing this ransomware campaign to CVE-2021-21974, the first party sources are not.
Tl;dr:
There are several high-profile OpenSLP vulnerabilities in various versions of ESXi
These vulnerabilities have been exploited in the past to install malicious backdoors
No CVE is being concretely attributed as the initial access vector for the ESXiArgs campaign by first-party sources
Metrics of Vulnerable Hosts
There are many companies that scan the internet with benign intentions for inventory, research, and actionable intelligence. GreyNoise sees these companies on a very regular basis, since we operate “sensors” similar to a honeypot. They scan, and we (GreyNoise) listen.
Without going into too much depth, there is a significant complexity jump between “determining if a port is open on a server” and “determining what protocol is operating on a port”.
For scanners, high interaction protocols such as those used by the ESXi OpenSLP service may be checked on a weekly/monthly basis, whereas more common protocols such as HTTP(s) on common ports like 80/443 may be checked nearly constantly.
Much like the variety of benign internet-wide scanning companies, GreyNoise is not the only organization operating honeypots on the internet. This causes biases in reported metrics of potentially vulnerable servers on the public internet.
Once an incident such as the ESXiArgs campaign has begun, “scanning” organizations will ramp up scanning, and “honeypot” organizations will ramp up honeypots. At this point, the ESXiArgs campaign is already underway and accurate metrics can be drawn upon from other attributes.
Tl;dr:
Metrics regarding vulnerable host counts have biases
Metrics regarding vulnerable host counts are scoped estimates
These metrics are still the most accurate reports available
Metrics of Compromised Hosts
One of the publicly visible aspects of the “ESXiArgs” campaign is that a ransom note is made available on a hosts public IP with a title of How to Restore Your Files.
By performing a query for How to Restore Your Files we can generate a list of the autonomous system organizations and countries affected by this campaign, complete with a generated timestamp since this number will continually fluctuate and is only accurate as a point-in-time metric.
OVH is the predominantly effected hosting provider
France (where OVH is primarily located) is the most impacted region
The estimated count of compromised hosts at the time of writing is between 1,500 and 2,000 nodes. Censys noted that the initial high water mark of compromised nodes was over 3,500.
How do you “GreyNoise” an unknown attack vector?
GreyNoise has had a tag available for tracking and blocking CVE-2021-21974 since June 2021:
At this moment in time, we’re seeing a the Log4j-style conundrum 😬; the majority of CVE related activity is related to benign cybersecurity companies checking for the presence of the vulnerability.
As described above, there is no confirmed reports of the initial CVE exploit vector, so how can GreyNoise help defenders when the attack vector is unknown?
As we explained in our “How to know if I am being targeted” blog, IPs that return no results when searched on GreyNoise are traffic that is targeted at your organization.
If your organization observes a connection from an IP that returns no search results in GreyNoise: You should almost certainly prioritize that investigation, because it’s targeting your organization instead of the entire internet. If you find that the IP not known to GreyNoise was connecting to your organization’s VMWare ESXi on TCP/427, you should definitely prioritize that investigation.
In cases where initial access vectors are unknown, using GreyNoise as a filter for internet background noise can help prioritize the things that matter the most, because absence of a signal is a signal itself.
You may have noticed an anomalous uptick of PUT requests in the GreyNoise sensors these past couple of days (2023-01-22 → 2023-01-24). For those interested, we’ve put together a quick summary of the details for you to dive into.
The majority of PUT requests occurred from January 15, 2023, to January 26th, 2023. During this time period, 2,927 payloads were observed containing HTTP paths of randomly generated letters and either a “.txt” or “.jsp” file extension. Similarly, the body of the PUT requests contained a randomly generated string as text and another randomly generated string contained within the markdown of a Jakarta Server Page (JSP) comment field. We believe this to be a methodology attempting to insert a unique identifier into the target server to determine potential capabilities of further exploitation; such as the ability to upload arbitrary files, retrieve the contents of the uploaded file, and determine if the JSP page was rendered (indicative of possible remote code execution).
Sample of Decoded HTTP PUT Payload
The remaining path counts can be seen here:
Table of HTTP PUT Path Counts Observed by GreyNoise Sensors
The most common being "/api/v2…", a path often found in FortiOS authentication bypass attempts. Check out our blog to learn more about tracking this exploit. We’ve also seen variations of "/FxCodeShell.jsp…", which is indicative of a Tomcat backdoor usage. Each has their respective packet example below:
FortiOS Authentication Bypass Attempt
Tomcat Backdoor CVE-2017-12615
Inquiring into these paths led to a discovery for us as well! Having been formally unfamiliar with the "/_users/org.couchdb.user:..." path, we did some digging, which led to a new signature for CVE-2017-12635.
Instances of Apache CouchDB Remote Priv Esc Attempts
This highlights novel ways attackers are attempting to fingerprint exposed services using known vulnerabilities, and is a starting point for hunting for additional malicious activity related to these requests.
When digging into these anomalies, GreyNoise researchers noticed a pattern of randomly generated JSP checking to see if they can upload, and then access their uploaded files.
The FortiOS authentication bypass used both the "/api/v2" HTTP PATH prefix along with a header of "User-Agent: Node.js"
To ensure we have as much visibility into activity on the internet as possible, we regularly deploy new sensors in different “geographical” network locations. We’ve selected two sensors for a short “week in the life” series to give practitioners a glimpse into what activity new internet nodes see. This series should help organizations understand the opportunistic and random activity that awaits newly deployed services, plus help folks understand just how little time you have to ensure internet-facing systems are made safe and resilient.
The “Benign” perspective
Presently, there are three source IPv4 classifications defined in our GreyNoise Query Language (GNQL): benign, malicious, and unknown. Cybersecurity folks tend to focus quite a bit on the malicious ones since they may represent a clear and present danger to operations. However, there are many benign sources of activity that are always on the lookout for new nodes and services, so we’re starting our new sensor retrospective by looking at those sources first. Don’t worry, we’ll spend plenty of time in future posts looking at the “bad guys."
While likely far from a comprehensive summary, there are at least 74 organizations regularly conducting internet service surveys of some sort (we’ll refer to them as ‘scanners’ moving forward):
AdScore
Ahrefs
Alpha Strike Labs
Ampere Innotech
ANT Lab
Applebot
Arbor Observatory
Archive.org
BinaryEdge.io
BingBot
Bit Discovery
Bitsight
BLEXBot
Caida
Censys
CERT-FR
Cloud System Networks
Cloudflare
Cortex Xpanse
CriminalIP
cyber.casa
CyberGreen
cymru
dataplane
DomainTools
Dutch Institute for Vulnerability Disclosure
errata
ESET
Facebook Crawler
FH Muenster University
GoogleBot
Internet Census
InterneTTL
Intrinsec
IPinfo.io
ipip.net
ipqualityscore
Knoq
LeakIX
Mail.RU
Max Planck Inst.
Moz DotBot
Net Systems Research
NetCraft
netsystems
ONYPHE
OpenIntel.nl
openportstats
Palo Alto Crawler
Petalbot
pnap
Project Sonar
Project25499
Quadmetrics.com
Qualys
Qwant
Recyber
RWTH AACHEN University
scorecardresearch
SecurityTrails
Seznam
ShadowServer.org
Shodan.io
Sogou
spyse
Stanford Univ.
stretchoid
Technical University of Munich
threatsinkhole
UMich
University of Colorado
VeriSign
WithSecure
Yandex Search Engine
We were curious as to how long it took these scanners to find our new nodes after they came online and were ready to accept connections. We capped our discovery period exploration at a week for this analysis but may dig into longer time periods in future updates.
Out of the 74 known scanners, only 18 (24%) contacted our nodes within the first week.
As the above chart shows, some of the more well-known scanners found our new sensor nodes within just about an hour after being booted up. A caveat to this data is that other scanners in the main list may have just tried contacting the IP addresses of these nodes before we booted them up.
One reason organizations should care about this metric is that some of these scanners are run by “cyber hygiene” rating organizations, and you only get one chance to make a first impression that could negatively impact, say, your cyber insurance premiums. So, don’t deploy poorly configured services if you want to keep the CFO happy.
Benign infrastructure
It’s pretty “easy” to scan the entire internet these days, thanks to tools such as Rob Graham’s masscan, provided you like handling abuse complaints and can afford the bandwidth costs on providers that allow such activity. We identified each of these scanning organizations via their published list of IPs. We decided to see just how many unique IPs of each scanner we saw within the first week:
Bitsight dedicates a crazy amount of infrastructure to poke at internet nodes. Same for the Internet Census. By the end of the week, we saw 346 unique benign scanner IPs contact our sensors, which means your internet-facing nodes likely did as well. While you may not want these organizations probing your perimeter, the reality is that, while you may be able to ask them to opt you out of scanning, you cannot do the same for attackers (abuse complaints aren’t a great solution either). Some organizations, ShadowServer in particular, are also there to help you by letting you understand your “attack surface” better, so you are likely better off using our benign classified IPs to help thin down the alerts these services likely generate (more on that in a bit).
The chart above also shows that some services have definite “schedules” for these scans, and others rarely make contact. Just how many contacts can you expect per day?
Hopefully, you are using some intelligent alert filtering to keep your defenders from being overloaded.
What are the scanners looking for?
Web servers may rule the top spot of deployed internet-facing services, but they aren’t the only exposed services and they aren’t just hosted on ports 443 and 80 anymore. Given how many IP addresses some scanners use and how many times the node in the above example was contacted by certain scanners, it’s likely a safe assumption that the port/service coverage was broad for some of them. It turns out, that assumption was spot-on:
At least when it comes to this observation experiment, Censys clearly has the most port/service coverage out of all the benign scanners. It was a bit surprising to see such a broad service coverage in the top seven providers, although most have higher concentrations below port 20000.
If you think you’re being “clever” by hosting an internet-facing service on a port “nobody will look at," think again. Censys (and others’) scans are also protocol-aware, meaning they’ll try to figure out what’s running based on the initial connection response. That means you can forget about hiding your SSH and SMB servers from their watchful eyes. As we’ll see in future posts, non-benign adversaries also know you try to hide services, and are just as capable of looking for your hidden high port treasure.
Going beyond benign
If we strip away all the benign scanner activity, we’re left with the real noise:
We’ll have follow-up posts looking at “a week in the life” of these sensors to help provide more perspectives on what defenders are facing when working to keep internet-facing nodes safe and sound.
Remember: you can use GreyNoise to help separate internet noise from threats as an unauthenticated user on our site. For additional functionality and IP search capacity, create your own GreyNoise Community (free) account today.
GreyNoise tags are described in the documentation as “a signature-based detection method used to capture patterns and create subsets in our data.” The GreyNoise Research team is responsible for creating tags for vulnerabilities and activities seen in the wild by GreyNoise sensors. GreyNoise researchers have two main methods for tagging: a data-driven approach, and an emerging threats-driven approach. Each of these approaches has three main stages:
Discovery
Research
Implementation
Data-driven approach
When using a data-driven approach, researchers work backward from the data collected by GreyNoise sensors. Researchers will manually browse data or create tooling to aid in finding previously untagged and interesting data. This method relies heavily upon intuition and prior expertise and has led to non-vulnerability-related discoveries such as a Holiday PrintJacking campaign. Using this approach, GreyNoise steadily works toward providing some kind of context for every bit of data opportunistically transmitted over the internet to our sensors.
During the discovery phase, researchers identify interesting data that does not appear to be tagged by manual or tool-assisted browsing of raw sensor data. Researchers will simply query the data lake for interesting words or patterns, using instinct to drive exploration of the data. Once they have identified and collected an interesting set of data, they begin the research phase.
During the research phase, the researcher works to identify what the data is. This could be anything from CVE-related traffic to a signature for a particular tool. They do this by scouring the internet for various paths, strings, and bytes to find documentation relating to the raw traffic. This often requires the researcher to be adept at reading formal standards, like Requests for Comments, as well as reading source code in a variety of programming languages. Once they have identified the data, the researcher will gather and document their findings before moving on to the implementation phase.
Using their research, the researcher will implement a tag by actually writing the signature and populating the metadata that makes it into a GreyNoise tag. Once complete, a peer will review the work, looking for errors in the signature and false positives in the results before clearing it for production.
Emerging threats approach
When using an emerging threats-driven approach, researchers seek out emerging threats observable by GreyNoise sensors. For the most part, GreyNoise only observes network-related vulnerability and scanning traffic. This rules out vulnerabilities like local privilege escalations. Using this method, GreyNoise can provide early warning for mass scanning or exploitation activity in the wild of things like CVE-2022-26134, an Atlassian Confluence Server RCE.
During the discovery phase, researchers monitor a wide variety of sources such as tech news outlets, social media, US CISA’s Known Exploited Vulnerabilities Catalog, and customer/community requests. Researchers identify and prioritize CVEs that customers and community members may be interested in due to their magnitude, targeted appliances, etc.
Similar to the data-driven approach, researchers will gather publicly available information regarding the emerging threat that will allow them to write a signature. Proof-of-Concept (PoC) code is often the most useful piece of information. On rare occasions, lacking a PoC, researchers will sometimes attempt to independently reproduce the vulnerability. Researchers will often attempt to validate vulnerabilities by setting up testbeds to better understand what elements of the vulnerability should be used to create a unique and narrowly scoped signature.
Finally, using all collected information, the researcher will seek to write the signature that becomes a tag. When doing this, researchers focus on eliminating false positives and tightly scoping the signature to the targeted data or vulnerability. When relevant for emerging threats, GreyNoise researchers will run this signature across all of GreyNoise’s historical data to determine the date of the first occurrence. This allows GreyNoise to publish information regarding when a vulnerability has first seen mass exploitation in the wild and, occasionally, if a vulnerability, like OMIGOD, was exploited before exploit details were publicly available.
How to use GreyNoise tags
GreyNoise provides insight into IP addresses that are scanning the internet or attempting to opportunistically exploit hosts across the internet. Tag data associated with a specific IP address provides an overview of the activity that GreyNoise has observed from a particular IP, as well as insight into the intention of the activity originating from it. For example, we can see that this IP is classified as malicious in GreyNoise because it is doing some reconnaissance but also has tags associated with malicious activity attempting to brute force credentials as well as traffic identifying it as part of the Mirai botnet.
GreyNoise tags are also a great way to identify multiple hosts that are scanning for particular items or CVEs. For example, querying for a tag and filtering data can show activity related to a CVE that is originating from a certain country, ASN, or organization. This gives a unique perspective on activity originating from these different sources.
Finally, tag data is accessible via the GreyNoise API and allows integrations to add this tag data easily into other products.
Practical takeaways from CISA's Cyber Safety Review Board Log4j report
The Cybersecurity and Infrastructure Security Agency (CISA)'s Cyber Safety Review Board (CSRB) was established in May 2021 and is charged with reviewing major cybersecurity incidents and issuing guidance and recommendations where necessary. GreyNoise is cited as a primary source of ground truth in the CSRB's first report (direct PDF), published on July 11, 2022, covering the December 2021 Log4j event. GreyNoise employees testified to the Cyber Safety Review Board on our early observations of Log4j. In this post, we'll examine key takeaways from the report with a GreyNoise lens and discuss what present-day Log4j malicious activity may portend.
Log4J retrospective
It may seem as if it's been a few years since the Log4j event ruined vacations and caused quite a stir across the internet. In reality, it's been just a little over six months since we published our "Log4j Analysis - What To Do Before The Next Big One" review of this mega-incident. Since that time, we've worked with the CSRB and provided our analyses and summaries of pertinent telemetry to help them craft their findings.
The most perspicacious paragraph in the CSRB's report may be this one:
Most importantly, however, the Log4j event is not over. The Board assesses that Log4j is an “endemic vulnerability” and that vulnerable instances of Log4j will remain in systems for many years to come, perhaps a decade or longer. Significant risk remains.
In fact, it reinforces their primary recommendation: "Organizations should be prepared to address Log4j vulnerabilities for years to come and continue to report (and escalate) observations of Log4j exploitation."(CSRB Log4j Report, pg. 6)
The CSRB further recommends (CSRB Log4j Report, pg. 7) that organizations:
develop strong configuration, asset, and vulnerability management practices
invest resources in the open-source ecosystems they depend upon
follow the lead of the Federal government and require and use Software Bill of Materials (SBOM) when sourcing software and components
This is all sound advice, but if your organization is starting from ground zero in any of those bullets, getting to a maturity level where they will each be effective will take time. Meanwhile, we've published three blogs on emerging, active exploit campaigns since the Log4j event:
Furthermore, CISA has added over 460 new Known Exploited Vulnerabilities (KEV) to their ever-growing catalog. That's quite a jump in the known attack surface, especially if you're still struggling to know if you have any of the entries in the KEV catalog in your environment.
While you're working on honing your internal asset, vulnerability, and software inventory telemetry, you can improve your ability to defend from emerging attacks by keeping an eye on attacker activity (something our tools and APIs are superb at), and ensuring your incident responders and analysts identify, block, or contain exploit attempts as quickly as possible. That's one area the CSRB took a light touch on in their report, but that we think is a crucial component of your safety and resilience practices.
Log4j today
Log4j is (sadly, unsurprisingly) alive and well:
This hourly view over the past two months shows regular, and often large amounts of activity by a handful (~50) of internet sources. In essence, Log4j has definitely become one of the many permanent and persistent components of internet "noise," and this is a further confirmation of CISA's posit that Log4j is here for the long haul. As if on queue, news broke of Iranian state-sponsored attackers using the Log4j exploit in very recent campaigns, just as we were preparing this post for publication.
If we take a look at what other activity is present from those source nodes, we make an interesting discovery:
While most of the nodes are only targeting the Log4j vulnerability, some are involved in SSH exploitation, hunting for nodes to add to the ever-expanding Mirai botnet clusters, or focusing on a more recent Atlassian vulnerability from this year.
However, one node has merely added Log4j to the inventory of exploits it has been using. It's not necessary to see all the tag names here, but you can explore these IPs in-depth at your convenience.
Building on the conclusion of the previous section, you can safely block all IPs associated with the Apache Log4j RCE Exploit Attempt tag or other emerging tags to give you and your operations teams breathing room to patch.
You are always welcome to use the GreyNoise product to help you separate internet noise from threats as an unauthenticated user on our site. For additional functionality and IP search capacity, create your own GreyNoise Community (free) account today.
The GreyNoise research team has reviewed a ton of IPv6 research and reading to provide a roadmap for the future of GreyNoise sensors and data collection. IPv6 is, without a doubt, a growing part of the Internet’s future. Google’s survey shows that adoption rates for IPv6 are on the rise and will continue to grow; the United States government has established an entire program and set dates for migrating all government resources to IPv6; and, most notably, the IPv4 exhaustion apocalypse continues to be an issue. As we approach a bright new future for IPv6, we must also expect IPv6 noise to grow. For GreyNoise, this presents a surprisingly difficult question: where do we listen from?
According to zMap, actors searching for vulnerable devices can scan all 4.2 billion IPv4 addresses in less than 1 hour. Unlike IPv4 space, IPv6 is unfathomably large, weighing in at approximately 340x1036 addresses. Quick math allows us to estimate 6.523 × 10^24 years to scan all IPv6 space at the same rate as one might use to scan IPv4 space. Sheer size prevents actors from surveying IPv6 space for vulnerabilities in the same way as IPv4.
But there’s a Hitlist?
Since actors cannot simply traverse the entire address space as they can with IPv4 space, determining where responsive devices might reside in IPv6 space is a difficult and time-consuming endeavor – as demonstrated by the IPv6 Hitlist Project. Projects like the Hitlist are critical as they allow academic researchers to measure the internet and provide context for the environment of IPv6. Without projects like this, we wouldn’t know adoption rates or understand the vastness of the IPv6 space.
Research scanning is one of the internet’s most important types of noise. It also happens to be the only noise that GreyNoise marks as benign. Unfortunately, researchers aren’t the only ones leveraging things like the Hitlist to survey IPv6 space. Malicious actors also use these “found” responsive IPv6 address databases to hunt vulnerable hosts. To better observe and characterize the landscape of IPv6 noise, GreyNoise must ensure that our sensors end up on things like the IPv6 Hitlist.
One strategy is to place sensors inside of reserved IPv6 space. IPv6 addresses can be up to 39 characters long, proving a challenge to memorize over IPv4’s maximum of 15. The reliance on DNS for devices will become even more prevalent as more organizations adopt IPv6, exposing reverse DNS as a primary method for the enumeration of devices. Following the Nmap ARPA scan logic, adding an octet to an IPv6 prefix and performing a reverse DNS lookup will return one of two results: an NXDOMAIN indicating no entry at the address or NOERROR indicating a reserved host. This method can efficiently reduce the number of hosts scanned in an IPv6 prefix, but does have the prerequisite of knowing the appropriate IPv6 prefix to add octets to check. Since GreyNoise already places sensors in multiple data centers and locations, any database, like the IPv6 Hitlist, will already include us.
Another method is to reside inside of providers that are IPv6-routed. BGP announcements provide a direct route to IPv6 networks, but an enumeration of responsive hosts is still an undertaking. Scanners will need to find a way to catalog and call back to the responsive hosts since there could still be many results (and the size of the address is much larger). Providers with IPv6 routing are growing and affordable, making it worthwhile for us to deploy sensors and work with widely used providers to determine who is already getting scanned using this method.
Our current IPv6 status
What we currently see in our platform begins with reliable identification of IPv6 in IPv4 encapsulation, often referred to as 6in4. None of our sensors are currently located on providers using solely IPv6; therefore, the packets will always be IPv4 encapsulated.
We also see users querying for IPv6 addresses in the GreyNoise Visualizer, but these queries are problematic; GreyNoise currently can do better when a user queries for an IPv6 address. Users regularly query for link-local addresses, which are addresses meant for internal network communications. Other queried addresses are often in sets that indicate users are querying IPv6 addresses in their same provider prefix. They may be querying their own IPv6 address or nodes that are attempting neighbor discovery. We are looking at ways to educate and notify users when they input these types of addresses to help them further understand the IPv6 landscape.
The future of IPv6
Though the technicalities of scanning for IPv6 are less straightforward than one would expect, GreyNoise looks to the academic research being done in the IPv6 field to inform future product strategies. As the attack landscape evolves, GreyNoise sensors placed in opportunistic paths will continue to gain and share meaningful IPv6 knowledge for researchers around the world.
This year marks the fifteenth anniversary of the Verizon Data Breach and Investigations Report (DBIR). If you’re not familiar with this annual publication, it is a tome produced by the infamous cyber data science team over at Verizon. Their highly data-driven approach (referencing 914,547 incidents and 234,638 breaches plus 8.9 TB of cybersecurity data) helps practitioners understand malicious cyber activity across the industry. The Verizon DBIR shows how threats are trending and evolving, as well as the impacts these malicious actions have on organizations of every shape and size.
This year, as in years gone by, GreyNoise researchers contributed our insights-infused, planetary-scale, opportunistic attacker sensor fleet data to the Verizon DBIR. This is the same data that fuels our platform and helps defenders mitigate threats, understand adversaries, and focus on what matters.
Let’s take a look at the key findings from the report, what our data has to say about the current threat landscape, and how you can use insights from our data to help keep your organization safe and resilient.
What Say You, DBIR?
The DBIR team provided five key elements in their overall summary, and we’ll take a modest dive into each of them.
Barbarians At The Gates
First up is how attackers breach your defenses. It will come as no shock to most readers that the use of credentials, phishing attacks, vulnerability exploitation, and botnets are all initial techniques that attackers use to breach the defenses of organizations.
Given the propensity of credential use in initial access, you may wonder why attackers even bother using other means of gaining a foothold. While there will always be services deployed on the internet with default credentials left intact, user credentials do not age very well and need to be re-upped regularly (usually by breaching an organization to steal them en masse). Make no mistake, they work far too often, especially for juicy services such as Microsoft’s Remote Desktop Protocol, which is why they are used in the first place.
Creds are noisy, and phishing does take some effort to do it well, even when using phishing kits or attacker phishing-as-a-service providers. Scouring the internet for vulnerable services is almost risk-free, relatively cheap, and can lead to remote code execution on a decent percentage of nodes, as Figure 43 of the DBIR shows (GreyNoise provided the data behind the chart for the DBIR team to work their magic on):
If you ensure you have safe and resilient configurations on your internet-exposed assets and mission-critical internal systems, plus have empowered your workforce to be co-defenders of your organization, you may avoid becoming a statistic, at least in this category.
Always Be enCrypting
Ransomware also plagued more organizations than ever this past year, with a 13% increase from 2020, as shown in our reimagining of Figure 6 in the DBIR. The DBIR’s ransomware corpus is far from complete, but aligns in proportion with statistics from other sources of ransomware incidents.
As noted in the text of the report, ransomware starts with some action; usually, one of those found in the Initial Actions noted above. Ransomware actors often take advantage of the latest and greatest exploits for recent CVEs, which is activity you can track in the GreyNoise platform to help you frame the need for speed when it comes to mitigating and patching.
Prioritizing patching actively exploited vulnerabilities should be at the top of your to-do list.
Attackers Getting High On Your Supply [Chain]
Supply chain attacks have been making headlines ever since the highly disruptive SolarWinds incident back in 2019 (though there have been numerous documented supply chain attacks long before that mega-event). The DBIR documented over 3,400 “System Intrusion” events this year, showing you need to be as vigilant on the inside as you are on your internet-facing attack surface; this ensures you aren’t a conduit to other organizations for criminals. Furthermore, you should have a solid third-party risk management program and some way to track software development dependencies, which prevents breach by those you trust.
A Bucketful Of Errors
In this year’s corpus, the DBIR team found that 13% of breaches were caused by errors, often when it comes to securing cloud storage. So, make sure you mind your buckets, but take some comfort: this particular disheartening statistic appears to be stabilizing.
To Err Is Still Human
Humans likely helped cause some (most?) of the aforementioned misconfigurations as well as many other incidents that ended up as breaches. As the DBIR researchers themselves note: “Use of stolen credentials, phishing, misuse, or simply an error, people continue to play a very large role in incidents and breaches alike.”
How To Avoid Being an Accidental DBIR Contributor Next Year
GreyNoise has tools, data, and insights which easily integrate into your comprehensive cybersecurity program to help keep your organization safe and resilient.
We’re almost halfway through the year, and if you’ve managed to avoid a major incident or breach so far, you’re doing pretty well. But (there’s always a “but”), we should note that we’re also likely to see more groups like LAPSUS$ pop up to use their smash-and-grab model. Plus, you’ve also got all the old-school attacks to worry about.
If you and your team can filter out the noise, figure out what you don’t need to do, and get visibility into the areas you do need to focus on (while ensuring you have a spot-on incident response program), you may just make it another year without adding to the 9.8 terabytes of data the DBIR team already has to crunch for each report.
This IP address has been observed attempting to exploit CVE-2021-40865, a pre-auth remote code execution vulnerability in Apache Storm supervisor server.
Hikvision IP Camera RCE Attempt [Intention: Malicious]
CVE-2021-36260
This IP address has been observed attempting to exploit CVE-2021-36260, a remote command execution vulnerability in Hikvision IP cameras and NVR firmware.
This IP address has been observed attempting to exploit CVE-2021-20034, an arbitrary file deletion vulnerability that allows performing a factory reset on SonicWall SMA100 devices.
This IP address has been observed attempting to exploit CVE-2021-22502, a remote command execution vulnerability in Micro Focus Operation Bridge Reporter software.
This IP address has been observed attempting to exploit CVE-2021-27561, a remote command execution vulnerability in Yealink Device Management Platform.
This IP address has been observed scanning the internet for WSMan Powershell providers without an Authorization header, a root RCE in Azure Open Management Infrastructure.
Sources: Wiz, Microsoft Security Response Center [1, 2, 3, 4]
This IP address has been observed scanning the internet for WSMan Powershell providers without an Authorization header, but has not provided a valid SOAP XML Envelope payload.
Sources: Wiz, Microsoft Security Response Center [1, 2, 3, 4]
Cisco IMC Supervisor and UCS Director Backdoor [Intention: Malicious]
CVE-2019-1935
This IP address has been observed attempting to authenticate via SSH using default credentials for Cisco IMC Supervisor and Cisco UCS Director products.
Our research team is always looking for ways to improve our tagging methodology to enable GreyNoise users to understand actor behavior and tooling. GreyNoise already identifies clients with JA3 and HASSH data.
To expand on this work, GreyNoise recently added 3 new tags to shed more light on how various internet background noise-makers using HTTP clients manage their internal state. The below tags improve client fingerprinting for HTTP-based protocols.
Carries HTTP Referer: This tag identifies HTTP clients that include a “Referer” header which indicates what page or site the HTTP request was referred from.
Stores HTTP Cookies - This tag identifies HTTP clients that allow “Cookies” to be set and stored in the client’s storage and are sent with subsequent requests.
Follows HTTP Redirects - This tag identifies HTTP clients that follow “Location” 301 (Permanent) redirects to another page or site.
On their own, each individual tag contributes a small indication of how the HTTP client manages its internal state. While that alone has value in helping to profile the actor behind the IP and possibly track them across IPs, the more interesting insights can be seen when these tags are viewed holistically.
Figure 1: Venn Diagram representing IPs that match each tag and their respective overlaps, data pulled on Aug. 25, 2021.
Figure 2: Venn Diagram representing IPs that match each tag and their respective overlaps, data pulled on Sep. 10, 2021.
As seen above, the tagged activity is not homogenous, allowing us a glimpse into the diversity of tooling or techniques used in scanning and opportunistic exploitation. While many actors may use the same exploit vector or payload, they may launch them from tools that support different HTTP features. These new tags may help the analyst determine if two IPs appear to be using the same tools.
Figure 3: IP Details page for 42.236.10.75. See it in the GreyNoise Viz.
For example, in Figure 3, we are able to determine with a high degree of confidence that the IP shown above is orchestrating a full-featured web browser (such as Puppeteer) to scan the internet. We see this because the IP exhibits browser-like behavior and attributes, including carrying a referer header, accepting cookies, and following redirects.
We hope these new tags offer our users greater insight into the tooling and libraries utilized by internet background noise-makers. Let us know what you think by sharing your feedback on the GreyNoise Community Slack channel (must have a GreyNoise account).