Six Personas for an Effective Cyber Data Science Team

Our CTO’s recent experience at a hackathon illustrates how the most potent data science in cybersecurity comes from an intriguing array of skill sets

Cybersecurity Data Science Persona

I was recently invited by an Interset customer to help with their hackathon event. This was an awesome event, where teams of employees gather to learn, brainstorm, invent, and build technology solutions to solve problems. I jumped at the opportunity to show the amazing things that can be accomplished with effective cyber data science, something that Interset’s R&D teams do each day, when we build that expertise into our technology. Our customers don’t need to have data science teams or 24-hour-long hackathons to take advantage of this bleeding-edge approach: Our data science teams have done the hard work and automated this for you!

At this event, we had more than 100 hackathon participants over three R&D sites. For 24 hours, in an awesome party-like atmosphere, we got together in the cafeteria to hack out solutions to some really tough and interesting problems.

Our team of eight people worked on a cool, important hackathon challenge. Given some VPN and authentication data sets that contained activities with evidence on a (caught) attacker, could we design and build models to detect and prevent similar attacks from happening in the future?

We also wanted to do this using statistical and machine-learning models, so that the models could generalize to detect attackers that used a similar infiltration style. This avoids the “hard-coding” that a more traditional, rules-based approach might take, as that would be too noisy and ineffective in practice.

In other words: data science to the rescue!

As the 24 hours progressed (and the pizza and coffee flowed), it struck me that the eight people around the table were fantastic representations of the combination of skills and personas required for effective cyber data science. As originally proposed by Drew Conway’s Venn Diagram, data science is not a single discipline, but a confluence of disciplines that span mathematics and statistics, computer science, and subject-matter expertise.

Below, I’ve listed the personas that make up an effective cybersecurity data-science team, based on the actual people who were on our hackathon team. But they are (in my opinion) the same personas that you want on any team, if you intend to build effective data-science systems for cybersecurity threat detection.

Data Science Team Blog

The Hacker: These are the security experts, pen testers, the white hats who can think like the threat actor and translate what shows up in digital-event log files into human behaviors. They’ve seen it before, know how it’s done, and how it will be done again. On our team, we had a CISO and a security analyst who were familiar with the specific breach in the log files, and also had years of experience defending this company against all sorts of hacks and nation-state attacks. They fit the (white-hat) Hacker persona perfectly.

The Historian: We were able to bring in subject-matter experts on the data files when we had very specific questions. In this case, these IT and firewall experts were able to explain the different configurations used by the authentication system, explain a gap in the data that was the result of an outage, describe why were were seeing three completely different formats in a single data set when the last system first came online, and so on. The Historians are important in understanding the semantics and context behind the digital bits.

The Coder: A large amount of time in real-world data science is spent doing data-wrangling: wrestling the data out of the source system, parsing out the records, writing code to intelligently filter out noisy records, cleaning the columns of data, and generating more predictive columns of data out of it. A lot of time here is spent developing code, everything from Python scripts to NiFi processors to SQL. On our team, the Coders were engineers, data scientists, and interns who bashed out a collection of Python, Haskell, and boatload of regular expressions, turning large volumes of ridiculously messy code into tidy data for analysis.

The Visualizer: When exploring data to look for trends and patterns, there is an almost constant stream of visualizations being produced. What is the relationship between the bytes transferred and the session duration? Where across the globe are the incoming connections coming from? Are there clusters of file-access patterns that emerge from the population that may be related to the organizational structure? These questions and more are best and most quickly answered by tapping into our natural ability to visually spot trends and patterns in charts and visualizations. Our team’s Visualizers were two data scientists and one BI expert who were able to quickly build visual insights for us using R and Excel.

The Modeler: The Modeler can take sentences with words like “weird,” “unusual,” and “anomalous,” and turn them into math. This math can become statistical tests to see if sentences that start out as hypotheses about how to detect a threat are statistically valid. They are then transformed into machine-learning models that can ultimately be deployed into a production environment. Our two data scientists used R and their background in quantitative sciences to iterate and try out different statistical models.

The Storyteller: The result of a cyber data-science team is a set of models that needs to be described to a stakeholder—justified and explained, and put into production. This almost always involves a communication step, where the Storyteller can explain (almost always to a non-technical audience) the data set, the behaviors that are important to detect, what the model does, the tests they have done, and the visualizations that show the model results and demonstrate its effectiveness. In real life, this is a presentation to the Board or the CISO. At the hackathon, this was a presentation to the judges. We did well!

One person may have multiple personas. For example, I was a Coder, Modeler, and Visualizer at different phases of the hackathon. However, ultimately, I believe you need access to all of these personas to do effective cybersecurity data science.

Also note that effective cybersecurity data science is very much a team sport. During the exploratory data analysis for example, it almost felt like we were all playing “math jazz.” In a highly ad-hoc and interactive manner, the Coders bring in data, Historians explain what they are seeing, the Hackers brainstorm potential indicators, the Visualizers determine whether or not the indicator shows a trend, the Modelers turn it into math, and the Storytellers assemble and communicate the results. The whole process is one glorious, iterative team effort where everyone is working together in this glorious, iterative, messy but team collaboration that drives real-world data science.

The hackathon was an exciting and fantastic example of the type of collaborative innovation and human-engineering that goes on daily for us at Interset HQ. We understand that finding and building a team with the six key persons of Hacker + Historian + Coder + Visualizer + Modeler + Storyteller is not easy. Our engineering team reflects what an effective cyber data-science team needs to be. That’s how we are able to bring embedded self-learning and pre-built models to our customers through our threat-detection platform. It’s what we love to do. I had a fantastic time at this company’s hackathon, and thank them for inviting me!

Learn More Check out “Machine Learning: A Primer for Security,” also penned by Interset CTO Stephan Jou