An Introduction to Cybersecurity Data Science

Understand the basics of people, process, and technology required to have an effective data science implementation within your organization.


As a lover of data science working in cybersecurity, I often get asked about how these disciplines work together effectively. I like to tell people that data science is a powerful weapon in cybersecurity—when leveraged correctly. And correct implementation typically requires a careful combination of the right people, processes, and technology. In this blog, I will share a few important lessons that I’ve learned in the past 20 years about effective data science, in the context of cybersecurity. 

An effective cybersecurity data science team 

In 2010, American data scientist Drew Conway created a Venn diagram of data science that provides a great starting point for today’s conversation. His three core elements were Hacking (in this case, referring to computer science skills), Math & Statistical Knowledge, and Substantive Experience. At the intersection of these three elements is Data Science. At the intersection of Hacking and Math & Statistical Knowledge lies ML; at the intersection of Math & Statistical Knowledge and Substantive Experience lies Traditional Research; at the intersection of Substantive Experience and Hacking Skills lies what Conway refers to as the “Danger Zone.”

Data Science Venn Diagram
Source: DrewConway.com

The Danger Zone is named as such because of the risk associated with relying only on computer science skills and substantive experience—or, in this case, cybersecurity experience. If you use statistics incorrectly, you can end up with an ineffective system, which translates to a lot of noise and false positives. What Conways’ diagram ultimately gets at is the need for a balanced knowledge and skills—one that spans computer science, tangible security experience, as well as math. 

In cybersecurity, I believe there are six “personas” that you want in order to create this type of balanced, effective team. You need a coder who can wrangle the data, parse records, and write code; a visualizer who builds accessible visualizations for trends and patterns; a modeler who turns words into statistics and math; a storyteller who can connect data to models to results to threats, effectively driving understanding from the SOC analyst to the board; a hacker who eats, breathes, and sleeps cybersecurity; and a historian who can bring subject matter expertise like threat hunting or forensic investigation. 

Human intelligence vs artificial intelligence

Let’s talk about AI in terms of something we all understand: a system diagram. Part of how we demonstrate intelligence is being able to sense and perceive things around us. We see colors, hear sounds, and feel objects. All of that “input,” so to speak, is processed in various ways. It informs and is informed by our knowledge and memories, we make decisions and inferences based on it, and we learn things based on the things we see and perceive. These functions of processing then result in an output, which will be our ultimate actions or interactions with the world around us. 

Human Learning Diagram
The process of human intelligence

Artificial intelligence can be visualized with a similar system diagram. We can visualize the “input” as natural language processing, speech recognition, etc. “Output” takes shape in things like robotics, navigation systems, speech generation, or—in the case of cybersecurity—the identification of security threats that may be hiding in your enterprise. And in the middle, there are areas of research in knowledge representation, ontologies, prescriptive analytics and optimization, and machine learning. Broadly speaking, there are two ways in which a machine can learn: supervised learning (learning by example) and unsupervised learning (learning by observation). 

Machine learning
The process of artificial intelligence

Identify the problem, then find the solution

Any vendor that leads with the algorithm as a selling point should elicit some skepticism. The most effective way to develop a cybersecurity data science solution is to start with the use case. Once you understand the use case(s), you can then identify the data sources that are available and most important for that use case. Remember that without data, no algorithm will be meaningful. Having better data trumps having the “better” algorithm. 

With the use case and data set at hand, as a data scientist, your next step is to perform an exploratory data analysis. Through this process, you form hypotheses about the columns of data in your dataset (or combinations of those columns) that are most useful in predicting the outcome you’re interested in. This ultimately leads to building your models, which essentially are useful combinations of data and math. Following this, you’re ready to put your data and math into production. This can be a simple step or a more complex one, depending upon where you’re deploying your tool. Consider the technology choices available to you that will help you go from your lab environment to your production environment (i.e. if you are dealing with vast quantities of data, you may choose to use NiFi or Flume to move data around, Spark to learn and score models at scale, Elasticsearch or Kibana to store and visualize results, and so on). 

Diagram Cybersecurity Data Science Process

Cybersecurity data science in action

Let’s put all of these elements together and look at a use case I’m very familiar with: the use of anomaly detection with unsupervised machine learning at Interset. Five years ago, we stumbled across a use case that is just as relevant today as it was back then: to automatically and quickly identify real threats. We wanted to find a solution that would be more efficient than the standard approach of rules, thresholds, and alerts. The standard approach is both manual, time-consuming, and ineffective because it’s impossible to apply one threshold or rule that is accurate for all users. Anomaly detection, on the other hand, allows us to baseline everyone—every user, IP address, machine, etc.—and model each individual entity’s normal behavior so that we can detect and quantify abnormal, anomalous behaviors. By doing this at a massive scale, we can compute across all anomalous behaviors across all entities to find the small number of entities that are most anomalous, and therefore represent the riskiest entities in the population. These become the high-quality leads that the security team can use to identify real threats.

Behind this anomaly detection is a mathematical architecture that represents the entire flow, from the set of input data sources (e.g. repository logs or Active Directory logs), the features or columns that are extracted from the data (e.g. the amount of data moved or the combination of file shares being accessed), the models that are run with the data (e.g volumetric models that look for unusual volumes of data moved, or file access models that look for unusual file shares or file access and usage models), and the final aggregation step to compute risk of every entity (resulting in a forced ranking to find those high-quality leads). 

A mathematical architecture is something that I recommend to every cybersecurity data science team and everyone considering deploying a data science solution. When you evaluate a new technology, you typically review the vendor’s technology architecture. The same should apply for data science. You don’t need to become a data science expert; simply open a dialogue with the vendor to better understand how the math they’ve selected is going to help you achieve your desired use cases. 

To learn more about cybersecurity data science and see examples of this anomaly detection in action, check out my on-demand presentation at the 2019 Micro Focus Cybersecurity Summit.