"Big Data Tools and Pipelines" entries

Using Apache Spark to predict attack vectors among billions of users and trillions of events

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.


In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Read more…


Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.


Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Read more…


Specialized and hybrid data management and processing engines

A new crop of interesting solutions for the complexity of operating multiple systems in a distributed computing setting.


The 2004 holiday shopping season marked the start of Amazon’s investigation into alternative database technologies that led to the creation of DynamoDB — a key-value storage system that went onto inspire several NoSQL projects.

A new group of startups began shifting away from the general-purpose systems favored by companies just a few years earlier. In recent years, we’ve seen a diverse set of DBMS technologies that specialize in handling particular workloads and data models such as OLTP, OLAP, search, RDF, XML, scientific applications, etc. The success and popularity of such systems reinforced the belief that in order to scale and “go fast,” specialized systems are preferable.

In distributed computing, the complexity of maintaining and operating multiple specialized systems has recently led to systems that bridge multiple workloads and data models. Aside from multi-model databases, there are an emerging number of storage and compute engines adept at handling different workloads and problems. At this week’s Strata + Hadoop World conference in NYC, I had a chance to interact with the creators of some of these new solutions.

OLTP (transactions) and OLAP (analytics)

One of the key announcements at Strata + Hadoop World this week was Project Kudu — an open source storage engine that’s good at both table scans (analytics) and random access (updates and inserts). Its creators are quick to point out that they aren’t out to beat specialized OLTP and OLAP systems. Rather, they’re shooting to build a system that’s “70-80% of the way there on both axes.” The project is very young and lacks enterprise features, but judging from the reaction at the conference, it’s something the big data community will be watching. Leading technology research firms have created a category for systems with related capabilities:  HTAP (Gartner) and Trans-analytics (Forrester).

Read more…

Comment: 1

Build better machine learning models

A beginner's guide to evaluating your machine learning models.


Everything today is being quantified, measured, and tracked — everything is generating data, and data is powerful. Businesses are using data in a variety of ways to improve customer satisfaction. For instance, data scientists are building machine learning models to generate intelligent recommendations to users so that they spend more time on a site. Analysts can use churn analysis to predict which customers are the best targets for the next promotional campaign. The possibilities are endless.

However, there are challenges in the machine learning pipeline. Typically, you build a machine learning model on top of your data. You collect more data. You build another model. But how do you know when to stop?

When is your smart model smart enough?

Evaluation is a key step when building intelligent business applications with machine learning. It is not a one-time task, but must be integrated with the whole pipeline of developing and productionizing machine learning-enabled applications.

In a new free O’Reilly report Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls, we cut through the technical jargon of machine learning, and elucidate, in simple language, the processes of evaluating machine learning models. Read more…


Three best practices for building successful data pipelines

Reproducibility, consistency, and productionizability let data scientists focus on the science.

Piping_EngineeringBuilding a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

  1. Reproducible
  2. Consistent
  3. Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Read more…

Comments: 6

Showcasing the real-time processing revival

Tools and learning resources for building intelligent, real-time products.

Earth orbiting sun illustration

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Read more…


The world beyond batch: Streaming 101

A high-level tour of modern data-processing concepts.


Editor’s note: This is the first post in a two-part series about the evolution of data processing, with a focus on streaming systems, unbounded data sets, and the future of big data. See part two.

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

As someone who’s worked on massive-scale streaming systems at Google for the last five+ years (MillWheel, Cloud Dataflow), I’m delighted by this streaming zeitgeist, to say the least. I’m also interested in making sure that folks understand everything that streaming systems are capable of and how they are best put to use, particularly given the semantic gap that remains between most existing batch and streaming systems. To that end, the fine folks at O’Reilly have invited me to contribute a written rendition of my Say Goodbye to Batch talk from Strata + Hadoop World London 2015. Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

So, long-winded introductions out of the way, let’s get nerdy. Read more…

Comments: 17

6 reasons why I like KeystoneML

The O'Reilly Data Show Podcast: Ben Recht on optimization, compressed sensing, and large-scale machine learning pipelines.


As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O’Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning.

At the 2014 Strata + Hadoop World in NYC, Recht gave an overview of a nascent AMPLab research initiative into machine learning pipelines. The research team behind the project recently released an alpha version of a new software framework called KeystoneML, which gives developers a chance to test out some of the ideas that Recht outlined in his talk last year. We devoted a portion of this Data Show episode to machine learning pipelines in general, and a discussion of KeystoneML in particular.

Since its release in May, I’ve had a chance to play around with KeystoneML and while it’s quite new, there are several things I already like about it:

KeystoneML opens up new data types

Most data scientists don’t normally play around with images or audio files. KeystoneML ships with easy to use sample pipelines for computer vision and speech. As more data loaders get created, KeystoneML will enable data scientists to leverage many more new data types and tackle new problems. Read more…

Comments: 2

The original big data industry

Oil and gas exploration have long been at the forefront of data collection and analysis.

Download our new free report, “Oil, Gas, and Data: High-Performance Data Tools in the Production of Industrial Power,” looking at the role of data, machine learning, and predictive analytics in oil and gas exploration.

Petroleum extraction is an industry marked by price volatility and high capital exposure in new ventures. Big data is reducing risk, not just to capital, but to workers and the environment as well, as Dan Cowles explores in the new free report Oil, Gas, and Data.

At the Global Petroleum Show in Calgary, exhibiting alongside massive drill heads, chemical analysts, and the latest in valves and pipes are companies with a decidedly more virtual product: data. IBM’s Aspera, Abacus Datagraphics, Fujitsu, and Oracle’s Front Porch Digital are pitching data intake, analysis, and storage services to the oil industry, and industry stalwarts such as Halliburton, Lockheed Martin, and BP have been developing these capacities in-house.

The primary benefits of big data occur at the upstream end of petroleum production: exploration, discovery, and drilling. Better analysis of seismic and other geological data allows for drilling in more productive locations, and continual monitoring of equipment results in more uptime and better safety for both workers and environment. These marginal gains can be enough to keep an entire region competitive: the trio of cheap sensors, fast networks, and distributed computation that we’ve so often seen in other industries is the difference-maker keeping the North Sea oilfields productive in sub-$100/barrel market. Read more…


Why data preparation frameworks rely on human-in-the-loop systems

The O'Reilly Data Show Podcast: Ihab Ilyas on building data wrangling and data enrichment tools in academia and industry.


As I’ve written in previous posts, data preparation and data enrichment are exciting areas for entrepreneurs, investors, and researchers. Startups like Trifacta, Tamr, Paxata, Alteryx, and CrowdFlower continue to innovate and attract enterprise customers. I’ve also noticed that companies — that don’t specialize in these areas — are increasingly eager to highlight data preparation capabilities in their products and services.

During a recent episode of the O’Reilly Data Show Podcast, I spoke with Ihab Ilyas, professor at the University of Waterloo and co-founder of Tamr. We discussed how he started working on data cleaning tools, academic database research, and training computer science students for positions in industry.

Academic database research in data preparation

Given the importance of data integrity, it’s no surprise that the database research community has long been interested in data preparation and data wrangling. Ilyas explained how his work in probabilistic databases led to research projects in data cleaning:

In the database theory community, these problems of handling, dealing with data inconsistency, and consistent query answering have been a celebrated area of research. However, it has been also difficult to communicate these results to industry. And database practitioners, if you like, they were more into the well-structured data and assuming a lot of good properties around this data, [and they were also] more interested in indexing this data, storing it, moving it from one place to another. And now, dealing with this large amount of diverse heterogeneous data with tons of errors, sidled across all business units in the same enterprise became a necessity. You cannot really avoid that anymore. And that triggered a new line of research for pragmatic ways of doing data cleaning and integration. … The acquisition layer in that stack has to deal with large sets of formats and sources. And you will hear about things like adapters and source adapters. And it became a market on its own, how to get access and tap into these sources, because these are kind of the long tail of data.

The way I came into this subject was also funny because we were talking about the subject called probabilistic databases and how to deal with data uncertainty. And that morphed into trying to find data sets that have uncertainty. And then we were shocked by how dirty the data is and how data cleaning is a task that’s worth looking at.

Read more…

Comment: 1