The Commoditization of Massive Data Analysis

Big Data is a major theme on the O’Reilly Radar, so we’re delighted to welcome guest blogger Joe Hellerstein, a Professor of Computer Science at UC Berkeley whose research focuses on databases and distributed systems. Joe has written a whitepaper with more detail on this topic.

There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.

We are at the beginning of what I call The Industrial Revolution of Data. We’re not quite there yet, since most of the digital information available today is still individually “handmade”: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation “factories” such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.

To get a glimpse at what that software might look like, consider today’s high-end deployments. There are a few different solutions, but they typically share the core technique of dataflow parallelism. Legions of disk drives are set spinning at once, pumping data through high-speed network interconnects to racks of CPUs, which crunch the text and numbers as they flow by. High-end relational database systems like Teradata have been using this approach for decades, and in the last few years companies like Google and Yahoo! have cranked up new tools to bring this process to a scale never seen before.

There has been a good deal of cheerleading and smoke-blowing about these two approaches to massive data parallelism in recent months. Setting aside the trash talk, the usual cases made for the two technologies can be summarized as follows:

Relational Databases MapReduce (Hadoop)
  • multipurpose: useful for analysis and data update, batch and interactive tasks
  • high data integrity via ACID transactions
  • lots of compatible tools, e.g. for loading, management, reporting, data visualization and mining
  • support for SQL, the most widely-used language for data analysis
  • automatic SQL query optimization, which can radically improve performance
  • integration of SQL with familiar programming languages via connectivity protocols, mapping layers and user-defined functions
  • designed for large clusters: 1000+ computers
  • very high availability, keeping long jobs running efficiently even when individual computers break or slow down
  • data is accessed in “native format” from a filesystem — no need to transform data into tables at load time
  • no special query language; programmers use familiar languages like Java, Python, and Perl
  • programmers retain control over performance, rather than counting on a query optimizer
  • the open-source Hadoop implementation is funded by corporate donors, and will mature over time as Linux and Apache did

This roughly captures the positive arguments being made today. Most of the standard critiques of each side are implicit in the bragging points on the other side. Neither side’s argument is strengthened by their leading player. Hadoop is still relatively young, and by all reports much slower and more resource intensive than Google’s MapReduce implementation. Meanwhile, many people equate high-end relational databases with Oracle, which has developed a reputation for being expensive and difficult to manage, and which does not run on massive clusters of commodity computers the way that many of its high-end competitors do.

Nobody in the data management industry is taking this debate lightly, and the safest prediction is that this landscape will change quickly. The lines between SQL and MapReduce have already begun to blur with the announcement from two database startup companies — Greenplum and Aster Data — that each released a massively parallel system integrating a full-featured SQL engine with MapReduce. Each touts a different mix of features from the MapReduce column: Greenplum supports filesystem access and removes any SQL requirements; Aster emphasizes their availability and scaling features. Meanwhile, the Hive open-source project from Facebook provides a SQL-like programming language layered on top of Hadoop. There are also research projects at both IBM and Microsoft in this space, which could evolve into products or open-source initiatives.

So where is all this headed? In the short term, the churn in the marketplace should drive a much faster pace of innovation than traditional database vendors provided over the last decade. The technical advantages of Hadoop are not intrinsically hard to replicate in a relational database engine; the main challenge will be to manage the expectations of database users when playing tricks like trading off data integrity for availability on certain subsets of the database. Greenplum and Aster will undoubtedly push to stay one step ahead of the bigger database companies, and it would not surprise me to see product announcements on this topic from the more established database vendors within the year.


As part of that evolution, I expect Greenplum, Aster and others to press the case for combining SQL and MapReduce within a single system image. In my own experience helping O’Reilly and LinkedIn use Greenplum MapReduce, I found it handy to have both SQL and MapReduce available. SQL’s built-in support for scalably joining tables makes it an easier and more robust choice for combining multiple data sets, and for writing code to follow links in a graph (e.g., to find related people in a social network). By contrast, MapReduce provides effortless access to open-source libraries from languages like Python, Perl and Java, which are much nicer than native SQL for handling text and arrays.

In the longer term, if MapReduce becomes an important programming interface for data analysis, there will be a need to standardize its programming interfaces, to provide compatibility and smooth the road to wider adoption. Alternatively, it is plausible that MapReduce is too low-level and restrictive for widespread use, and it will be wrapped into a higher-level query language like Hive or Yahoo’s Pig that will need the same kinds of standardization.

An interesting question is how this will interact with the push toward data-centric web services and Cloud computing. Will users stage massive datasets of proprietary information within the Cloud? How will they get petabytes of data shipped and installed at a hosting facility? Given the number of computers required for massive-scale analytics, what kinds of access will service providers be able to economically offer? It’s not clear that anybody has the answers yet in this department.

On a separate note, Hadoop has captured the enthusiasm of educators as a “gateway” to parallel programming. Every computer science major at Berkeley, for example, will write Hadoop MapReduce programs in at least one class (though currently they do so in the educational Scheme language.) Other leading schools have similar programs. Meanwhile, declarative languages — of which SQL is the most successful example — are experiencing a renaissance in computer science research, as I’ve written about elsewhere. So in addition to the changes in the market, the production line for these technologies is being filled at the source. This suggests the possibility of ongoing change over the next many years.

tags: ,