Transcript: Will the Real AI Please Stand Up [Webinar]

Supervised vs Unsupervised Machine Learning in Cybersecurity


This is the transcription to the first portion of the webinar, “Will the Real AI Please Stand Up?” You can view the webinar slides online or watch the webinar in full on demand.

Start transcript

Stephan: All right, let’s do it. Thank you everyone for joining this webinar. We’re going to talk about Will the Real AI Please Stand Up.

It’s great to have you here. Just quick introductions: I’m Stephan Jou and I’m the CTO here at Interset; background in analytics; lots of fun and excitement implementing analytical systems to solve some really hard and really important problems. We are very excited to talk about types of AI that can be acquired to solve some really hard important problems in cybersecurity, and how to tell the difference between effective AI, so-called “real” AI, and what may not be quite as effective as it could be. With me is Mario Daigle.

Right at the top I want to make one thing clear. This could be a really technical talk, but it’s my intention not to make it very technical. I don’t think we need a lecture in artificial intelligences—as much fun as that would be for me. We’re also going to keep the scope of this to cybersecurity. There’s a lot to AI: There are a lot of techniques, there are a lot of technologies, there is certainly a lot of really impressive work that is going on across the board in many areas of the industry. But we want to focus on cybersecurity specifically because that’s what our day jobs are.

I also want to touch a little bit on different types of AI and different types of machine learning, in particular, but I want to frame it in the context of hopefully useful information that will help you evaluate what type of AI a particular vendor is using, what type of technology is applicable to a specific problem, and what type of technology might not be applicable.

I’ll also mention we are happy to take whatever questions you might have at the end of the presentation.  We’re going to give you some information and show you a live demonstration of AI and how that looks inside our Interset product. Whenever you something comes to mind, if there’s something that we mentioned that you would like additional information on or additional detail, please feel free to enter it into the interface and we’ll address them at the end of the demo.

With that said, it’s really no surprise that there’s this huge amount of interest in AI. Every week there’s some major announcement that involves AI, and whether it’s AI for good (like Google and self-driving cars) or AI for less good reasons (like Cambridge Analytica), there is certainly no shortage of progress that seems to be coming in from all areas of the universe related to AI. And there’s certainly a lot of reasons to think about how AI can be applied to cybersecurity. Indeed, if you think about everything from Siri and Google Voice Assistant to some of the things you might have heard about AI generating pieces of art or music or film scores, you know it’s pretty impressive what the possibilities of really good AI seem to offer us inside cybersecurity.

With that said, the unfortunate side effect is that it feels like machine learning and artificial intelligence is almost everywhere. All you need to do is go to RSA, for example, and you’ll see no shortage of technologies and products (of which I confess we are one for sure), but they will use terms like “We use artificial intelligence,” “our machine learning has 100% no false positives.” They’ll assault you with different algorithm names like recursive Bayesian estimation, and I can truly sympathize with you if you are in the process of evaluating different technologies to find the right fit for your organization and for the cybersecurity challenges that are talked about for you. It truly is confusing.

So what I wanted to do, as I said earlier, there’s a lot to AI that I can talk about, but in the context of cybersecurity—in the context of the use cases that we can solve today effectively with AI—here are two main categories specifically of machine learning that we should keep in mind. These two categories are widely known as supervised and unsupervised. So, if there’s one thing that you take out of this webinar today, I’m hoping you develop an intuitive understanding of what these two classes of machine learning are and what they’re good at, because I think that is the most important question that you can ask yourself and the vendor and the technology to see if it’s a good fit. Because, to be honest, one is not better than the other. One just happens to be more applicable to one set of use cases, and a second technique is applicable to a second set of use cases.

To make matters worse, there’s a lot of pressure on people to so jump right to algorithms. It’s very impressive when you say, “we do recursive Bayesian estimation” or “we do a linear regression” and so on. It sounds very impressive, and don’t get me wrong—it’s probably important for you to ask questions about the underlying algorithms. But even the diagram that you see here—it’s already a simplification of algorithms and how they relate to supervised versus unsupervised.

Machine Learning Categories

It is not the case that if you are using ensemble methods, which you see kind of in the middle of the green column, that you are only looking at a regression-based system that is exclusively using supervised approaches. The reality is there is applicability of different algorithms to different use cases, including the scenario where something that is an unsupervised problem can be solved using supervised methods and vice versa. So, don’t get me wrong. It’s important to get an understanding of the algorithms just because it’s an easy question to ask, but it is not quite as cut-and-dry to be able to say, “Ok, here the algorithms. They all fall under unsupervised, therefore this is a supervised technology.” Unfortunately, I wish it was that simple, but it’s not.

So, what I want to do instead is go up a level and I want you to just temporarily forget about these algorithms. I mean, I can talk about these all day, but for the time being, let’s just try and get an intuitive understanding of supervised versus unsupervised and what they mean. So, instead of the word supervised, I want us to think about it as “learning by example,” and then instead of the word unsupervised I want us to think about that as “learning by observation.” Let’s drill into those two categories a little bit and then at the end sort of map it back to cybersecurity and, hopefully, that’ll be helpful.

When I think about supervised learning—this is learning by example and is very similar to the way most of us were taught in school or when we learn the names of things as we grew up and learned to speak English. Basically, when something is a supervised problem that needs to be solved using supervised methods, what you typically have is a set of data and an associated set of answers. Basically, by looking at the answers (so these examples that you have) and by understanding the relationship between the raw data and the answer column, you’re able to develop “learning” that allows you to see an example in the future and give it that name. So an example here, which you see in the illustration, is a bunch of items and some of them are animals and some of them are our food. As you can see, sometimes it’s actually quite difficult to actually know the difference between a dog and a muffin just purely visually. But, guess what? Supervised machine learning approaches actually can make that determination.

Supervised Learning

With enough examples, it actually can do a pretty good job at learning the fine differences between something that visually looks very similar. That’s that’s the promise of machine learning. But that is all predicated on having enough data. If you do not have enough examples (i.e if all you had in the third column here are these four examples of a blueberry muffin versus a dog versus a chihuahua), then any machine learning algorithm will struggle—just like how any child that is just looking at these pictures for the first time might take a little bit of time before he learns the differences between a dog and the muffin. In fact, the nature of data, the amount of data, and the number of correct labels that you have in that data (the number of examples that you have, in other words) are so important that typically the data that you have and amount of it is much more important than the actual algorithm that you choose to perform your supervised machine learning. It’s a well-understood result that the data actually trumps the algorithm that you choose. So with enough data, you can actually do amazing stuff.

So, if we go on to the next slide, the nice thing about the supervised approach is it allows you to categorize things once you have enough examples. And it’s important when you start thinking about the data set and the problem, you need to sort of think about how generalizable the data is because that’s what’s going to make the models–the algorithms–work inside your environment. Here’s a good example. It is good for supervised machine learning to have a problem like this: Given this face, what emotion is that person displaying? The reason is twofold.

Reason number one is that we actually have a lot of data out there. You can imagine trolling YouTube or looking for images on the on the web and, if you have enough time, you can label them “happy” or “sad.” Then there is no shortage of information that is sufficiently large for a machine learning algorithm to learn that if the mouse is turned up a certain way and the eyebrows are angled a certain way and at the cheek structures position the certain way, then this person is probably happy. On the other hand, if it’s positioned this way, then this person is probably sad. Just like how with enough data we can learn the difference between a blueberry muffin and a chihuahua, with enough data a machine learning model using supervised methods can learn to tell the difference without being told between the emotions of the persons displayed. That’s reason one: We’ve got lots of data.

Reason number two is that the problem it is solving is actually very generalizable. It doesn’t matter if the human picture that the machine learning algorithm is seeing for the first time has never been seen before by the machine learning algorithm. It is almost universal the way the human race displays happiness versus sadness versus anger. So you can build a model on a million faces and then apply that model to a million new faces that the algorithm has never seen before and the answers will be correct. That’s what I mean by generalizable. That’s a really good useful thing that it lends itself to be perfect for supervised approaches.

So, in contrast, without learning by example—let’s talk about learning by observation. This is different. The main difference is that we are not giving answers and we are not giving specific labels for an algorithm to learn. Instead, we’re just giving an algorithm a bunch of data and we’re saying, “Here are a bunch of unlabeled data and I want you to find an interesting combination. I want you to find things that are similar to each other. I want you to find groups and find things that are normally associated with each other.” So, it’s a little bit more difficult to define in words, but the idea here is that, with enough data and with enough observations, you can learn that these things are very similar to each other. This is just like how a child can learn that this is one group of animals that they see all the time on the street and, they might not know that they’re called dogs, but the child would learn that they all have long hair and long tails and they all make a barking sound. On the other hand, there’s another separate group of animals that the child might learn are cats and, without being told, the child might not know that they’re called cats, but they might learn that they make this meowing sound they have short hair and typically shorter tails. So basically it’s unguided in that respect. You have algorithms that actually are really good at that and, again, with enough data, you don’t have to have all the answers. You don’t need labels. So that’s the advantage: You’re still able to derive value out of it.

The example is: instead of trying to look at a face and understanding whether that face is happy or sad or angry, I want you to think about the problem when we’re trying to understand when this person is happy, what’s the context that actually makes that person happy? One possible set of clues might be a set of circumstances like this. If you look at the A versus B column, you know perhaps this person is happy when they are wearing bright clothing or when they are messing around with props. That’s one possibility given this data that I see here. If I look at the second column, perhaps this person is happy when they are cleaning the kitchen or cooking a meal, or perhaps this person is happy when they just left the coffee shop with their favorite coffee or they are making coffee at home. So perhaps they are a coffee lover and so on and so forth.

Unsupervised ML

So with enough clues, we can identify these patterns–these combinations–that are normal for a specific user. We’re not defining a specific set of things for them to look at explicitly like we were when we were saying this is happiness, this is sad, or this in angry. Instead, we’re giving the algorithm a lot of data and saying what the context for normal. The reason this matters is because building and understanding for normal patterns of normal behavior are very important for anomaly detection. And anomaly detection within the space of unsupervised learning is probably the most important technique for cybersecurity today.

So one way of thinking about this is seeing “learning by example” as a classroom environment versus “learning by observation” as learning by being out in the real world. In a classroom, you’re told the answers and you’re learning based on information that the teacher gives you. So you’re building up your learning model from that. That’s different when you’re just wandering around the streets picking up observations learning, “Hey, this is what the normal social behavior for Harry. This is what makes Harry happy. He likes to drink coffee and Harry enjoy cleaning.” So that’s the difference between supervised versus unsupervised.

Another way of thinking about it is to map it back to the original dataset. So let’s imagine that we had the same set of images as before (dogs, cats, and food products) and we didn’t give any labels–so there are no names and nothing that tells the algorithm that this is a food and this is an animal. If we–instead of looking at this as a supervised learning problem like the original slide—we thought of this as unsupervised, what are the things that an algorithm might automatically define and discover on its own? So this is what I mean by learning by observation. An unsupervised algorithm will learn these patterns of normal behavior. You can see we sort of limited to the visual domain—all of these clusters the visually similar but you can imagine if there is other input related to the sound that this object makes or whether the object moves or not, it might come up with different clusters as well. But that’s the idea. It’s coming up with patterns of common behaviors in order to understand and build an understanding of what normal behavioral patterns are for each and every item inside the dataset.

Unsupervised Patterns

So, why is this important? It’s important for cybersecurity because the technique and the set of AI technologies that you choose matters. It matters a lot, and it matters a lot based on the use case that you’re trying to solve. I think about malware. Malware is perfect for supervised machine learning methods. When you evaluate a company that uses AI to solve a problem like malware, you should think about the data or ask about the data set. In this case, the data set for malware is huge. We’ve got decades of malware and we’ve got decades of binaries that we can learn from, and we know whether something is malware or not. So, you can derive information that can be learned in a model on a separate set of examples–a very large set of malware examples—and you can build those models in lab and then you can deploy them super effectively inside your environment. In other words the data set is big and it generalizes really well. If the malware looks the same in your environment as it does in this data set, then the model works perfect.

However, when you think about other use cases like insider threat, that is a perfect example of unsupervised machine learning. That is because one nice way of identifying insider threat or attacks that might be external without having inside evidence (like compromised accounts), you can think about things like an account that is normally never active after midnight but all of a sudden is and that’s unusual, or an account that just sent 500 megabytes in an email but has never done that before and so on. So, these are all examples of patterns of normal behavior that can be derived from data but don’t require any labels because there are no labels in insider threat. No one has a large data set like we have with malware—you have 30 days of Active Directory logs and there are exactly five insider threats here and here they are. Those data sets are very very difficult to come by. So, problem number one is that one we don’t have a large dataset with a large number of labels.  That already tells us we should probably think about unsupervised methods.

Problem number two is that even if we did have a data set, will the information in that data setwill those models that learn from this dataset-generalize to your environment? For pretty much all insider threats, the answer is no. The person that stole data from a bank or Sony is going to manifest itself differently inside your employee base. The models that train on one dataset aren’t going to generalize inside your department, so you need a technique that actually leverages data that you’ve already got but doesn’t require labels. When you think about the dataset and think about the use case, that’s really the best way to think about how you can evaluate the real effective AI from the AI won’t be as effective for your particular use case.

That’s the general idea. There are lots of other things that you can drill into. Of course, you can ask about algorithms and you should be nervous about anyone that is unwilling to share openly about the algorithms that the use. Of course, you should ask about reference customers and how widely deployed this technology is. That’s very important because you know nothing is proven as data that and technology that is already proven. But from an AI perspective, my suggestion is that the best thing you can do is ask what the use case or ask about how the particular type of AI or ML in this technology applies to the use that you have in mind. Then, with an understanding of unsupervised and supervised, ask about the data set that is tied it. What data do these models learn from and how does that apply to my environment? How do you refresh that information? So just asking those two questions about the data (comparing that to your use case) and supervised and unsupervised will already put you ahead of the game from the vast majority of people that are struggling in cyberspace.