AI & Machine Learning 101-Part 5: Detecting Bot Attacks vs Human Attacks with ML

Unsupervised machine learning can help determine if a cyber attack is launched by a human or by bots.


Data breach
Of machine learning’s many useful applications in security, the ability to differentiate between human and non-human behaviors is becoming a critical advantage. Most of the internet today is automated, running day-in and day-out with little-to-no supervision. And while many of these automated “bots” make our lives easier (i.e. search engine bots), there are plenty of automated processes that are behind cybersecurity attacks. Bot attacks have become worryingly frequent in recent years, thanks to their ability to function at high speeds with relative anonymity, and malware that pervasively spreads within an organization and programmatically connects back to a remote server, are just two examples of attacks that involve non-human behaviors.

Being able to discriminate between humans and non-humans also allows us to improve the efficacy of threat detection and reduce false positives. The behaviors we expect from humans differ greatly from the behaviors we expect from non-humans. Expected normal behaviors for humans are completely different than behaviors for non-humans. For example, do you expect interactive logins to appear from a service account? These differences mean that you risk getting a lot of “noise” and false positives unless you apply different models to one versus the other.

Effective analytics that works in real-world scenarios are not single models, but rather sequences of models that are chained together. Here, we would first predict whether an account as either human or non-human and then apply the appropriate set of downstream behavioral models afterwards‒just like taking a fork in the road.

An intelligent human‒let’s call her Suzie‒can complete this process manually by looking at the log files. For example, Suzie might notice a particular source code log file shows that a user has been taking source code nearly every hour of every day for the past seven days. She concludes that the “user” can’t be a real human and that there must be a script or something automated taking place. In another case, Suzie might see that a certain set of NetFlow records shows a lot of TCP activity on port 465 at 3:00 a.m. Based on what she knows from previous experience and her knowledge that TCP on port 465 is likely outgoing email traffic, this activity is not a combination she expects from an actual user sending emails. Suzie identifies this as a spambot.

Suzie is very capable of this task, but, unfortunately, the task is much bigger than she can handle alone. Think about it: how many humans would it take to thoroughly look at every source code log file and every NetFlow record, for every possible combination of user, port and time, for Suzie and her team to be 100 percent certain that no threat has been missed? It would simply take too many humans and too many hours‒and this case involves just two data sources. Imagine if there were a dozen data sources, or even a hundred. This is where machine learning makes a difference. Machine learning can automate what Suzie does, turning it into a highly scalable and cost-effective way for her team to analyze billions of events every day and identify the small handful of actual threats.

Statistics and probabilistic methods are very good at detecting both patterns and noticing the unusual and abnormal. By observing your company’s information, these methods can measure normal behavior patterns and detect when a user account deviates from it. Additionally, these methods can also identify if a user’s activity is being carried out by an actual person. With machine learning, you can evaluate network records and other datasets to make that discrimination to separate between humans and machines, automatically.

There are many types of machine learning, as we learned earlier in this guide. For this particular job, we can choose unsupervised, online machine learning to reduce the burden on security teams further. Unsupervised machine learning means that you don’t have to rely on a human that tells the security analytics systems which data records belong to human activity and which data records belong to bots. This type of labeling exercise would clearly take a lot of time, and more importantly, would likely vary between different users (i.e. For some users sending out emails at 3 a.m. might be perfectly normal, given their local time zone).

Online machine learning means that the models learn in situ, from your data within your environment (offline machine learning, in contrast, means that your models learn on a fixed set of data from a historical snapshot). Online learning tends to be more flexible and does not make any assumptions about your environment. Ultimately, it does a better job keeping up with your environment as it changes and reduces the need for you to regularly re-tune them. As you can imagine, the humans and automated scripts and services in your environment are unique to your company and change all the time, so online machine learning avoids the continuous, time-consuming adjustments that a more traditional rules-based approach would demand.

Ultimately, machine learning reduces the burden on security teams while increasing accuracy of threat detection. Enterprises today simply have too many data sources with too much information stemming from them to be able to effectively analyze it all through manpower. Unsupervised, online machine learning makes the impossible much more practical.  

Stephan Jou is Chief Technology Officer at Interset.