How Cybersecurity Can Leverage Big Data

A summary of insights into how big data enables faster threat detection and improved SOC efficiency

IT SOC security

In the age of digital transformation, big data is on a forward path in multiple aspects of enterprise operations, cybersecurity included. But what is actually needed to make it work? Magical data scientists who can wave wands to suddenly solve all your problems? Vast server farms storing terabytes of data? There is more than meets the eye when it comes to accelerated threat detection and response with big data and machine learning.

Many people think Equifax/SEC/Deloitte/Whole Foods should have done better. Yet we understand it’s not so easy to juggle resources, expertise, time, and technology.

And yet there is no choice. As Gartner states, “By 2020, 60% of businesses will suffer major service failures due to IT security teams’ inability to manage digital risk.

The only way to get out of that game of Whac-a-Mole, is to deploy security analytics.

But not just any security analytics. To be successful and scalable, security analytics needs to be self-learning, automated, and integrated. Without the self-learning, you can’t scale. If it’s not automated, you can’t scale. If it’s not integrated with your existing security tools, it’s going to leave you with blindspots in your visibility.

Along this line, I do want to expand upon this aspect of big data and machine learning, specifically the difference between big data storage and big data compute. Big data storage means you have the ability to store vast amounts of data. This is the data lake—like a library full of books. But big data compute is something different. It relies on data being available, but it provides a layer on top to make it useful.

A book on a shelf—or a data file stored on a server—is of no use unless someone knows to either look for it, or to connect it (in context) to another piece of stored data. Meaning and value come from the big data compute that connects the dots. And I’m not just talking about an indexing system. There is far too much data— too many dynamic and variable types of data—to process and extract meaning from it by just having an index, or a digital Dewey Decimal System. You still need metadata and tagging, but it’s not going to help you get the scalability of data processing that you need.

What you really need is the big data compute to be self-learning, automated, and integrated, so it can keep pace with the Whac-a-Mole game that’s in play.

Big Data Storage vs BIg Data Computer Interset Security Analytics

This requires machine learning, and it’s important to know that not all machine learning is the same.

There are two major categories of machine learning, supervised and unsupervised. Supervised machine learning is where data is well-labeled, organized, almost tidy. It’s when you have a vast amount of big data, with labels. (An example of datasets where we have a lot of labels is malware.) Unsupervised machine learning is where data is more sparse, less labeled, and you need to infer meaning from looking at the data, rather being told what the data is about. This is where the machine can collect data, recognize patterns that are emerging, and bring them to your attention. It is more suitable for use cases like Advanced Persistent Threats (APTs) or insider threats.

There is an ability to infer insight from the data. Because it’s a machine, it’s able to take many, many more pieces of data than a person probably could and compare it to the inferred hypothesis, which increases accuracy.

Supervised vs Unsupervised ML Interset

To test your knowledge on what’s true, what’s a lie, and what’s a fairy tale (something based in truth, but not 100% true), check out these slides. Tell us how you do, and contact us with any questions you may have!