How to implement a security data lake

Practical tips for centralizing security data.

Information security has been dealing with terabytes of data for more than a decade — almost two. The benefits of having more data available spans many use cases, from forensic investigations to pro-actively finding anomalies and stopping adversaries before they cause harm.

But let’s be realistic. You probably have numerous repositories for your security data. Your Security Information and Event Management (SIEM) solution doesn’t scale to the volumes of data that you would really like to collect. This, in turn, makes it hard to use all of your data for any kind of analytics. It’s likely that your tools have to operate on multiple, disconnected data stores that have very different capabilities for data access and analysis. Even worse, during an incident, how many different consoles do you have to touch before you get the complete picture of what has happened? I would guess probably at least four (I would have said 42, but that seemed a bit excessive).

When talking to your peers about this problem, do they tell you to implement Hadoop to deal with the huge data volumes? But what does that really mean — is Hadoop really the solution? After all, Hadoop is a pretty complex ecosystem of tools that requires skilled and expensive people to implement and maintain.

How to centralize all of your security data

A central goal of the data lake is to get rid of data duplication, by collecting, cleaning, and enriching data just once — and then making it available through a standard interface to all the tools and products that need it. Of course, this is all much simpler said than done.

In the latest O’Reilly report The Security Data Lake: Leveraging Big Data Technologies to Build a Common Data Repository for Security, I address how to use modern big data technologies to centralize all of your security data in one place, and I explore the issues and approaches surrounding the lake, and what it means for third parties to leverage the shiny new data store. This free report also explains how Hadoop plays with other technologies to enable big data analytics and the discovery of insights.


Components of a data lake. Image courtesy of Raffael Marty.

Download the free report to learn more, including:

  • Five questions to consider before choosing an architecture for your backend data store
  • Four methods for embedding your SIEM into a data lake
  • Options for storing context and unstructured log data
  • Processes necessary for ingesting data into a data lake, including parsing, enrichment, and aggregation
  • Data access use cases, covering both search and analytical queries via SQL

How to choose architecture for your back-end data store

Understanding how your data is going to be used is critical to choosing the right architecture for the back-end data store. The report walks through each of the following five key questions to consider:

  1. How much data do we have in total?
  2. How fast does the data need to be ready?
  3. How much data do we query at a time, and how often do we query?
  4. Where is the data located, and where does it come from?
  5. What do you want to do with the data, and how do you access it?

The discussion of access paradigms is organized into groups, including search, analytics (record-based, relationships, and data mining), raw data access, real-time statistics, and real-time correlation.

The future of security monitoring

Collecting data in a central repository is a great starting point to move your security monitoring program into the future, to find threats and attacks before they cause harm. But it’s just the first step. The way you actually process the data afterward and the methods you use to gain insights are the much more fun and exciting endeavor.


Example visual analytics display. Image courtesy of pixlcloud.

Visual analytics is a method to help analysts quickly understand the data and find areas of interest. It’s a process that requires a scalable and flexible data back end. It is not uncommon that gigabytes, maybe even terabytes, of data need to be queried for a specific analytics task. Related questions you might find yourself asking include: What are the data access requirements? How can we run data mining algorithms, such as clustering across all of the data? What kind of data store do we need for that? Do we need a search engine as a back end, or a columnar data store? Within the report, I cover the concept of a data back end, and how it enables a variety of processing and access use cases. By no means is the data lake a replacement for your SIEM, but it’s a great addition to help move your security organization into the future.

You can download the free report here.

How have you implemented a security data lake? I’d love to hear your experience. Email me at

tags: , , , , , ,