Filtering Noise in (Cyber)Space

Capturing a clear image of a distant galaxy isn’t as simple as pointing a telescope and clicking the shutter. In every telescope, charge-coupled device (CCD) sensors introduce their own “noise” – fixed patterns and electronic readouts that can obscure faint starlight. To correct this, astrophotographers take bias frames: ultra-short exposure images with the lens cap on, which record nothing but the camera’s inherent read noise.

By averaging many bias frames into a master bias and subtracting it from their actual images, they remove the sensor’s baseline noise, revealing the true celestial signal. The result is a much cleaner photograph of the night sky, where dim stars and galaxies pop out once the background noise is calibrated away.

GreyNoise applies a similar idea – but instead of cleaning up starlight, we’re cleaning up Internet traffic. Security teams today face a deluge of “internet background noise”: endless connection attempts from botnets, benign research scanners, crawlers, and other opportunistic actors that scan the entire IPv4 space. This activity isn’t a targeted attack on your network; it’s the digital equivalent of the sensor noise in a camera.

GreyNoise captures that background noise so it can be subtracted from the view, allowing analysts to focus on the important signals. We maintain a global network of sensors in diverse providers around the world, listening to the flood of mass-scanning traffic that blankets the internet. By soaking up all that unsolicited, untargeted traffic – the “noise floor” or “bias frame” of the internet – GreyNoise provides a reference of ambient noise to filter out.

When an IP address reaches out to you, GreyNoise can tell you if it’s just a benign scanner or something targeted and malicious. GreyNoise collects and analyzes untargeted, widespread, and opportunistic scan-and-attack activity across the entire IPv4 space, giving defenders the ability to filter this useless noise out.

In essence, we subtract the background noise of opportunistic scans to reveal the “stars” not in our data – the genuine and targeted threats that deserve attention.

‍

Global Observation Grid: Our Other Atmosphere Observatory

Just as astronomers build observatories in multiple locations or launch telescopes into space to get a more complete view of the sky, GreyNoise deploys sensors globally to gain a comprehensive picture of Internet-wide activity. We operate thousands of sensors across many countries and networks, continuously gathering data on incoming connection attempts.

Each sensor is like a tiny telescope pointed at the “digital sky” - what I call our “other atmosphere” - capturing what hits it. Individually, a single sensor will immediately see a barrage of traffic – a new GreyNoise sensor shows scans from dozens of benign services (Shodan, Censys, etc.), opportunistic malware probes, and curious researchers, all within seconds of coming online.

But any one sensor only sees part of the big picture. Certain scanners might not reach a sensor due to network geography, IP range biases, or timing. That’s why breadth is key: by aggregating data from sensors in diverse autonomous systems and regions, GreyNoise’s platform can observe a far richer set of internet noise than any single point of view.

This global strategy is scientifically grounded in the idea of maximizing observation coverage. In astronomy, you wouldn’t survey the sky from just one telescope and assume you’ve seen everything – you’d miss most of the heavens. Likewise, one honeypot in one network will miss scans that simply never hit that network. Our team continually experiments with deploying sensors in new places (different countries, cloud providers, ISP networks, etc.) to minimize blind spots in our data.

The practical payoff for defenders is huge: with broad sensor coverage, GreyNoise can identify which scan or exploit campaigns are truly widespread versus those that might only be targeting a specific region, sector, or business vertical. It also means when GreyNoise says “we haven’t seen this IP or this exploit anywhere in our sensor grid,” that assertion carries weight – it’s akin to an astronomer saying a celestial object is truly unique after checking multiple telescopes. By investing in a sensor network that spans the globe, we ensure our survey of Internet noise is as complete as possible. And when gaps are suspected, we turn to science to guide our next moves.

‍

Estimating the Unseen

Capturing data from worldwide sensors is only part of the challenge; another part is understanding what we are missing. Here, GreyNoise draws inspiration from ecology and biochemistry to quantify the unknown.

In ecology, researchers sample a forest for species, but they never catch every creature – so they use mathematical models to estimate how many species they didn’t observe. We do the same for scanning IPs and exploit behaviors on the internet.

One model we use comes from enzyme kinetics in biochemistry: the Michaelis–Menten equation. In its original context, Michaelis–Menten describes how the rate of a chemical reaction saturates as substrate concentration increases – it’s a curve that rises quickly at first and then levels off at a maximum (Vmax) when adding more substrate no longer increases the reaction rate. That same S-shaped, asymptotic curve turns out to describe a lot of discovery processes, including finding new scanning threats. If we plot the cumulative number of unique scanner IPs or exploit techniques observed versus the amount of sensor data collected (over time or as we add more sensors), we often see a similar saturation pattern. Initially, each new sensor or each additional day of data collection yields a trove of “new” IP addresses we hadn’t seen before – the curve shoots upward. But over time, the curve starts to bend and approach an asymptote, indicating we’re catching most of what’s out there.

We can fit a Michaelis–Menten-like function to this curve to estimate its horizontal asymptote (Vmax), which in our world would be the total universe of scanner IPs or exploits active on the internet within a given period.. Using this method, if our current observations are well below the asymptote, we know there are likely more actors or activities lurking beyond our view. This approach isn’t arbitrary – ecologists have widely used the Michaelis–Menten model to estimate the richness (S) of species pools, treating the asymptote of the curve as an estimator of total species in an environment. We’re simply treating internet scanners as the “species” and our sensor deployments as the sampling effort.

In addition to curve-fitting techniques, GreyNoise leverages classic non-parametric biodiversity estimators to gauge visibility gaps. An example is the Chao1 index, originally developed by ecologist Anne Chao. Chao1 looks at how many species were seen only once or twice in a sample and uses that to guess how many species were likely missed entirely.

If our sensors saw a lot of “singleton” IP addresses - scanners that hit just one sensor one time - or only a couple of instances of a particular exploit payload, Chao1 tells us that there are probably many more we haven’t observed yet. Formally, Chao1 is a non-parametric estimator of the number of unobserved species in a sample – it adds an estimate of unseen species to the observed count based on the count of singletons and doubletons.

For GreyNoise, a high Chao1 estimate (significantly larger than the number of unique IPs or exploits we’ve seen) is a red flag that our coverage may be incomplete.

Alongside Chao1, we also use the Abundance-based Coverage Estimator (ACE) index – another ecological richness estimator that focuses on sample coverage and rare occurrences. ACE is widely utilized in ecological research as an estimator of total species count, and in our context it provides a complementary way to assess whether our sensor data has “covered” the space of scanning activity or if we’re still missing some rare scanners. Both Chao1 and ACE give us concrete numbers to quantify the gap between what’s observed and what’s out there – scientific confidence intervals for our visibility.

Another tool is the species accumulation curve, borrowed from field ecology. A species accumulation curve plots the cumulative number of species discovered against the effort or samples taken. Early in a survey, the curve climbs steeply with each new sample yielding new species; later on it levels off, indicating you’re approaching full coverage of species in that ecosystem.

The Internet is our ecosystem. By plotting the cumulative count of unique malicious IPs detected versus the number of sensors deployed or versus time, we get a visual sense of our coverage.

If the curve is still climbing sharply, it tells us that adding more sensors or running longer will catch many more new IPs or attack types – we are under-sampled and have visibility gaps. Conversely, if the curve starts to plateau, that suggests diminishing returns – our current sensor and fleet is approaching comprehensive coverage for that specific region or software/hardware profile. This guides practical decisions: a steep curve might justify quickly rolling out sensors in new networks or regions to gather the missing data, whereas a flat curve might indicate that our existing deployment is sufficient or that we need a fundamentally different approach to find what remains unseen.

In practice, we often compare accumulation curves for different subsets of data – for example, one curve for scanning on common ports vs. another for obscure ports, or separate curves for different geographic regions or different providers – to pinpoint where the biggest unseen opportunities lie. If one region’s curve is flatter than another, it suggests our coverage in that region is more complete, whereas a region with a steep curve likely needs more attention.

‍

Cause and Distribution

The most interesting actionable intelligence from our observations come from when there is a mean shift in the data. The question of coverage becomes: if there is a particular anomalous activity, how many observations do we need to make to determine that is the case?

This is a loaded and important question. Put simpler, how many sensors should I run in a given provider or region, and how do I know something weird is happening on the internet or to my network?

Events and traffic on the internet are not evenly distributed, and do not arrive on a normal Gaussian bell curve distribution. There is significant special cause variation and burstiness, leading to a Poisson distribution. The goal is to suss, rule, or filter out some of the special cause variation.

How many sensors do I need to confirm that what I am seeing is normal for a region and can filter out special cause variation, and how many samples related to each factor do I need to detect a mean shift?

Breaking down these questions is like that scene from Arrival where the linguistics professor is explaining how to convey the actual concept of a question to the aliens. You have to boil down absolutely everything into the smallest possible atomic values.

Say you run a coffee shop and there is a competing coffee shop across the street. Over the course of the day, people will arrive on a Poisson distribution. By counting the people arriving over time, you can make a determination of how many coffee machines or baristas you need to prevent a line out the door and long wait times.

Common cause variation here is the weather, the time of day (morning versus lunch versus night), and people choosing one shop versus the other. Special cause variation is when a coffee machine breaks or if there was an event nearby that led to more customers.

Now imagine there is a movie theater down the street, and more customers come in when a movie finishes. This is an example of special cause variation. Counting people coming into the coffee shop, how many people over how little time have to come in before I can say with 95% to 99% certainty that a movie just ended? How long do I have to wait to get enough confidence to say that a mean shift occurred? How do I determine if that mean shift was due to common cause variation or special cause variation? If I normally have four baristas on staff that can service 50 customers an hour and produce 100 drinks an hour, how many baristas do I need working when a movie finishes?

To tie it together - the coffee shop across the street is getting a lot more customers (attackers, scanning and exploitation traffic). They are running a special discount, free coffee with a movie ticket stub (software with a new disclosed vulnerability). Customer traffic drops off at your shop and picks up at the other. You need to run the same discount (vulnerable software) to attract that traffic.

What if the competing coffee shop was seeing triple the number of customers than yours without running an announced promotion? Could there be an undisclosed vulnerability - an 0day? Is it just because the other shop is closer to the movie theater?

There are reasons why things happen, and those reasons are equally important to knowing that they happened.

‍

Science-Guided Coverage and Continuous Tuning

All these methods – whether from astronomy, ecology, or chemistry – serve a common purpose: to ground GreyNoise’s threat intelligence in data and scientific rigor. Instead of guessing where to place new sensors or simply reacting to what we happen to catch, our team follows the evidence. The analogies to bias frames and species surveys aren’t just metaphorical; they actively shape how we operate.

When our analysis showed that adding sensors in certain under-monitored networks was yielding a disproportionately high number of new unique IPs (a steep accumulation curve), we knew those networks were fertile ground – and we invested in deploying more sensors there.

When our Chao1/ACE estimates indicate a large unseen population of exploitation traffic (a big gap between observed exploit count and estimated total exploit count), we dedicate effort to figure out what those unknown exploits might be – perhaps by analyzing one-off oddball traffic (the “singletons”) more closely, or by collaborating with partners to get data from other vantage points.

In one case, noticing a high estimated unseen count for a particular exploit prompted the team to simulate that exploit traffic to better understand its behavior, which in turn helped us tune our sensors to detect it more reliably going forward.

This work is driven by GreyNoise’s research and data science teams – the group of analysts, engineers, and scientists known as GreyNoise Labs who sit at the intersection of threat intelligence and analytics. The team’s diverse backgrounds are reflected in the approaches we use: it’s not every day that you find an IT or security team citing enzyme kinetics papers or biodiversity indices, but that’s exactly the kind of cross-disciplinary thinking that guides our strategy.

By experimenting, modeling, and validating with real data, the team ensures that GreyNoise’s sensor deployment strategies rest on a scientific foundation rather than vibe-based hunches. This scientific grounding also adds credibility to our insights. When we tell a customer that a surge in activity is just “noise” they can safely ignore, it’s because we’ve done the homework – much like an ecologist explaining that they’ve likely identified 95% of species in a habitat with a given confidence level. When we highlight an emerging threat, we do so with quantified estimates of how prevalent it is and how much more of it might be out there unseen.

GreyNoise’s approach to global internet traffic is part astronomy, part ecology, and all data-driven science.

From CCD bias frames we learned to subtract away the sensor noise to reveal faint signals, and we apply that daily by filtering out mass-scanning background noise to expose genuine threats. From ecology, we adopted tools to measure what we haven’t yet observed – treating IP addresses and exploits like species to be catalogued and estimated.

These analogies aren’t just clever comparisons; they’re working techniques that improve how we tune our sensors and interpret our data. By standing on the shoulders of scientific giants – be they astronomers or ecologists – GreyNoise is able to provide a clearer, more focused view of the threat landscape to security professionals. We believe that an accessible, scientifically-informed approach demystifies the internet’s noise and helps defenders make better decisions.

In the end, whether you’re looking at a photograph of a galaxy or an inbound firewall log, the goal is the same: remove the noise, enhance the signal, and find the truth that lies in the data.

This article is a summary of the full, in-depth version on the GreyNoise Labs blog.

Read the full report