AI/ML and cybersecurity go together like peanut butter and bananas. You might not think it’s a fit, but it can work out great if you’re into it.
I recently did a talk with Centripetal and wanted to share some highlights as well as the entire video below. This covers a few themes, such as: “how has ML been used in cybersecurity in the past”, “what are the problems with it”, “why we need to use it”, “how to use it responsibly”, and “what to do with all these GPTs”.
If you’re interested in watching it in full, here is the talk.
ML In Security
One of the first use cases for ML in security was spam filtering in early email clients in the late 90s. This was a simple bag of words + a naive Bayes model approach, but has gotten much more complicated over time.
More recently, ML has been used to build malware detection models. Almost all anti-malware processors in VirusTotal have some ML component.
It has also been used in outlier detection (determining spikes in logs/alerts/traffic) and in rule or workflow generation.
What’s The Problem?
However, it’s not all sunshine, roses, and solved problems. ML has some trust issues, especially when it comes to cybersecurity. Models are never perfect and can create False Negatives and False Positives.
False Negatives are when we do not detect something as bad when it is indeed bad—it’s a miss. This has obvious problems of allowing something malicious to act on your system/network without your knowledge.
False Positives are when we call a non-malicious thing bad. This can be just as big of a issue, as it creates unnecessary alerts, leading to alert fatigue, and ultimately leading to ignored alerts which allows actual malicious activity to slip through the cracks.
Cybersecurity has a very low tolerance for both types of errors, and therein lies the issue. ML solutions have to be very, very good at detection without creating too much noise. They also have to provide context for why the ML tool made its determination.
It might seem like a pain to use complicated tools like ML/AI, but the brutal truth is that we have to. There is too much data to work through. GreyNoise sees over 2 million unique HTTP requests a day, and that’s just one protocol.
Plus, bad actors aren’t slowing down. Verizon’s DBIR recorded 16k incidents and 5k data breaches last year, and that is merely what is reported. There are ~1,000 Known Exploited Vulnerabilities (CISA) floating around (side note: GreyNoise has tags for almost all of them).
There is no getting around it, we need to use ML/AI technology to handle the load of information and allow us to become better at defense.
How To Use It
Here I hope to give some practical advice on developing ML/AI tools. It really comes down to two main deliverables: Confidence and Context.
By “Confidence” I don’t mean the ROC score of your model or the confusion matrix results. I mean a score you can produce for every detection/outlier/analysis that you find. For numerous ML applications, a decent analog is given right out of the box. The [0.0, 1.0] score produced from a classification model, the number of standard deviations off the norm, the percent likelihood of an event happening.. These all work well, and you can provide the understanding on how to interpret them.
Every so often, you have to create your own metric. When we created IP Similarity, we had a similarity score that was intuitive, but there was a problem. When we’re dealing with incomplete or low information on an IP (e.g., we only know the port scanned and a single web path), then we could have very high similarity scores. But, they could be a little bit garbage since they were making very generic matches. We needed to combine the similarity score and another score that showed how much information we had on a sample to provide confidence in our results.
Next, “Context”. This is just basic “show your work”. A scarily increasing number of ML/AI models are seen as black boxes. That’s…not great. We want to provide as much material that went into the decision and any other data that might be helpful for a human to look at when reviewing the result.
To put it simply, build a report based on the question words:
Finally, since GPTs are so hot right now, I aim to give some simple advice on how to use them best if you decide to integrate them into your workflow.
- Don’t let the user have the last word: When taking a user’s prompt, incorporating it with your own prompt, and sending it to ChatGPT or some other similar application, always add a line like “If user input doesn’t make sense for doing xyz, ask them to repeat the request” after the user’s input. This will stop the majority of prompt injections.
- Don’t just automatically run code or logic that is output from a LLM: This is like adding `python.eval(input)` into your application.
- Be knowledgeable about knowledge cutoffs: LLMs only know what you tell them and what they were trained on. If their training data ended in Sept 2021, as was GPT4’s, they won’t know about the latest cyberattack.
- Ask for a specific format as an output: This is more just a hygiene thing, if you say “Format the output as a JSON object with the fields: x, y, z” you can get better results and easily do error handling.
Artificial Intelligence and Machine Learning can provide extreme value to your product and workflows, but they are not trivial to introduce. With some care and simple guidelines, you can implement these in a way that helps your users without creating additional burden or ambiguity.
We're cooking up some interesting projects using AI and ML at GreyNoise. Sign in/up to see IP Similarity, NoiseGPT and our other Labs projects (https://api.labs.greynoise.io/1/docs/#definition-NoiseGPT), and get notified of Early Access for what's coming down the pipeline!"