Using Apache Spark to predict attack vectors among billions of users and trillions of events

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.


In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Below are some highlights from our conversation:

Unsupervised learning for detecting fraudulent users and behavior

Let me step back a little bit and explain how traditional solutions identify bad accounts or bad behavior. Traditionally, the typical solution is rule-based. For example, a user may not be allowed to just register, and immediately start to transfer money or immediately starting send a lot of email. That behavior is bad, so you write a rule based on that. But a rule-based solution is very reactive. You need to observe what attackers are doing and then based on that, you derive expert rules. Rule-based systems are hard to maintain and are always late because a human needs to observe the bad behavior and start to write the rules. Nowadays, a rule-based system is one solution, but a lot of online services are moving to a machine learning-based solution. They have some bad labels and then they train a model.


Discover unknown attacks without requiring labels or training data. Source: Fang Yu, used with permission.

In DataVisor, we developed a brand new solution, which is unsupervised. We do not require clients to give us labeled data. In our approach, we do not only look at a single user’s behavior. We put all the users together and study correlations between the users and how users link to each other, how similar are the users’ actions. Nowadays, bad attackers do not have a single bad account. They usually have tens of accounts, hundreds, even millions of accounts. Using these accounts, they can do spam, they can do “likes,” they do transactions. These accounts usually have high correlations among them because they’re controlled by robots or controlled by trained people. For us, we look at the user-user correlation.

An ecosystem that supports attacks across different industrial sectors

Because we look at the account level and how users behave, our engine is quite general to different sectors. We have clients in social media, mobile gaming, and we’re also working with a client in financial services. The reason that our engine can work across different sectors is that we look at the notion of accounts and the underground ecosystem that supports massive attacks to different services [and which can] have the same set of people. Some people specialize in registering bad accounts, some people specialize in stealing credit cards, and some people specialize in writing templates, etc. So, there is an underground ecosystem in the tools they use, the data centers that they use, the VPNs they use. There are a lot of commonalities across different sectors.

Apache Spark

We have clients that send us billions of events per day, so it’s a huge amount of data, and you want to find a small amount of bad users. It’s like finding a needle in a haystack without any labels. It’s very challenging. There are also a lot of the social network elements associated with security. Some attackers want to actively friend because the more they friend, the more they can spam them, etc. The resulting graphs can be massive.

One of our founding members also came from Berkeley and he used Spark before; when we wanted to scale the system, Spark was a very natural choice. We have had a very positive experience. Spark is very easy to use and it has a great community; it helped us scale our system pretty well.

Note: Fang Yu’s frequent collaborator and DataVisor co-founder Yinglian Xie will speak about Leveraging Apache Spark to analyze billions of user actions to reveal hidden fraudsters at Strata + Hadoop World in San Jose this March.

Related resources:

Main art image by Wald1siedel on Wikimedia Commons.

tags: , , , , , , , ,