The Commoditization of Massive Data Analysis

Big Data is a major theme on the O’Reilly Radar, so we’re delighted to welcome guest blogger Joe Hellerstein, a Professor of Computer Science at UC Berkeley whose research focuses on databases and distributed systems. Joe has written a whitepaper with more detail on this topic.

There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.

We are at the beginning of what I call The Industrial Revolution of Data. We’re not quite there yet, since most of the digital information available today is still individually “handmade”: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation “factories” such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.

To get a glimpse at what that software might look like, consider today’s high-end deployments. There are a few different solutions, but they typically share the core technique of dataflow parallelism. Legions of disk drives are set spinning at once, pumping data through high-speed network interconnects to racks of CPUs, which crunch the text and numbers as they flow by. High-end relational database systems like Teradata have been using this approach for decades, and in the last few years companies like Google and Yahoo! have cranked up new tools to bring this process to a scale never seen before.

There has been a good deal of cheerleading and smoke-blowing about these two approaches to massive data parallelism in recent months. Setting aside the trash talk, the usual cases made for the two technologies can be summarized as follows:

Relational Databases MapReduce (Hadoop)
  • multipurpose: useful for analysis and data update, batch and interactive tasks
  • high data integrity via ACID transactions
  • lots of compatible tools, e.g. for loading, management, reporting, data visualization and mining
  • support for SQL, the most widely-used language for data analysis
  • automatic SQL query optimization, which can radically improve performance
  • integration of SQL with familiar programming languages via connectivity protocols, mapping layers and user-defined functions
  • designed for large clusters: 1000+ computers
  • very high availability, keeping long jobs running efficiently even when individual computers break or slow down
  • data is accessed in “native format” from a filesystem — no need to transform data into tables at load time
  • no special query language; programmers use familiar languages like Java, Python, and Perl
  • programmers retain control over performance, rather than counting on a query optimizer
  • the open-source Hadoop implementation is funded by corporate donors, and will mature over time as Linux and Apache did

This roughly captures the positive arguments being made today. Most of the standard critiques of each side are implicit in the bragging points on the other side. Neither side’s argument is strengthened by their leading player. Hadoop is still relatively young, and by all reports much slower and more resource intensive than Google’s MapReduce implementation. Meanwhile, many people equate high-end relational databases with Oracle, which has developed a reputation for being expensive and difficult to manage, and which does not run on massive clusters of commodity computers the way that many of its high-end competitors do.

Nobody in the data management industry is taking this debate lightly, and the safest prediction is that this landscape will change quickly. The lines between SQL and MapReduce have already begun to blur with the announcement from two database startup companies — Greenplum and Aster Data — that each released a massively parallel system integrating a full-featured SQL engine with MapReduce. Each touts a different mix of features from the MapReduce column: Greenplum supports filesystem access and removes any SQL requirements; Aster emphasizes their availability and scaling features. Meanwhile, the Hive open-source project from Facebook provides a SQL-like programming language layered on top of Hadoop. There are also research projects at both IBM and Microsoft in this space, which could evolve into products or open-source initiatives.

So where is all this headed? In the short term, the churn in the marketplace should drive a much faster pace of innovation than traditional database vendors provided over the last decade. The technical advantages of Hadoop are not intrinsically hard to replicate in a relational database engine; the main challenge will be to manage the expectations of database users when playing tricks like trading off data integrity for availability on certain subsets of the database. Greenplum and Aster will undoubtedly push to stay one step ahead of the bigger database companies, and it would not surprise me to see product announcements on this topic from the more established database vendors within the year.


As part of that evolution, I expect Greenplum, Aster and others to press the case for combining SQL and MapReduce within a single system image. In my own experience helping O’Reilly and LinkedIn use Greenplum MapReduce, I found it handy to have both SQL and MapReduce available. SQL’s built-in support for scalably joining tables makes it an easier and more robust choice for combining multiple data sets, and for writing code to follow links in a graph (e.g., to find related people in a social network). By contrast, MapReduce provides effortless access to open-source libraries from languages like Python, Perl and Java, which are much nicer than native SQL for handling text and arrays.

In the longer term, if MapReduce becomes an important programming interface for data analysis, there will be a need to standardize its programming interfaces, to provide compatibility and smooth the road to wider adoption. Alternatively, it is plausible that MapReduce is too low-level and restrictive for widespread use, and it will be wrapped into a higher-level query language like Hive or Yahoo’s Pig that will need the same kinds of standardization.

An interesting question is how this will interact with the push toward data-centric web services and Cloud computing. Will users stage massive datasets of proprietary information within the Cloud? How will they get petabytes of data shipped and installed at a hosting facility? Given the number of computers required for massive-scale analytics, what kinds of access will service providers be able to economically offer? It’s not clear that anybody has the answers yet in this department.

On a separate note, Hadoop has captured the enthusiasm of educators as a “gateway” to parallel programming. Every computer science major at Berkeley, for example, will write Hadoop MapReduce programs in at least one class (though currently they do so in the educational Scheme language.) Other leading schools have similar programs. Meanwhile, declarative languages — of which SQL is the most successful example — are experiencing a renaissance in computer science research, as I’ve written about elsewhere. So in addition to the changes in the market, the production line for these technologies is being filled at the source. This suggests the possibility of ongoing change over the next many years.

tags: ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Map reduce is not the only approach.
    check out

  • Tom

    What I’ve noticed that’s interesting at Yahoo! is that people do things with data they never would have bothered to do before because it’s easy and available.

    Yahoo! has a number of large grids preloaded with lots of data from the company’s web servers and other sources. Having such a system available means that people are creating reports that they never would have conceived previously.

    Map Reduce and Hadoop/Pig specifically has really helped us lower the cost of analyzing data to the point where big data isn’t a barrier any more and people are running the reports they want to run instead of worrying as much about the time/cpu cost.

    I really hope that by contributing to Apache Hadoop we can make this possible for other people, and I look forward to seeing more tools like Hive from some of the other great companies contributing to the project.

  • urca braz

    for a company doing commercial SaaS cloud-based massive parallel processing, see

    similar to Vertica, mentioned in a prior comment, SenSage ( developed a column-oriented, distributed, redundant DB in 2001-2002. Their tech adviser Stonebraker ripped off the idea to form Vertica.

  • @urca braz sez:
    Their tech adviser Stonebraker ripped off the idea to form Vertica.

    Er, that statement shows a remarkable ignorance of who Mike Stonebraker is and his contributions to the field since the 1970s. A simple search in Google Scholar brings up 1,010 articles he’s listed as a coauthor on.

    Calling Mike a “tech adviser” is like calling Tim O’Reilly a “tech writer” or Larry Lessig a “paralegal.” Have some respect.

  • Jim Stogdill

    Great post Joseph. I’ve been feeling like MapReduce / Hadoop would eventually be important in the enterprise space but was having trouble putting intuition to words. Now I don’t have to. I’ve just been sending this post to everyone I know.

  • Tom’s thoughtful comment deserves a response.

    It’s interesting to think about fostering data-driven culture, both technologically and organizationally. Avinash Kaushik has written and spoken on this topic.

    In terms of technology, a big question is how an organization moves in that direction — especially at massive scale — without the kind of engineering team you have at Yahoo or Google, and without a Fortune 500-quality IT department. My experience talking to customers is that the ability to wrangle Big Data falls off pretty fast when you leave the top tier of employers.

  • Urca Braz

    @Carl Malamud: see page

    “Michael Stonebraker, Ph.D. is a co founer and serves as Chief Technology Officer of Vertica Systems Inc. and StreamBase Systems, Inc. Mr. Stonebraker serves as Technology Advisor of SenSage, Inc.”

    I apologize, maybe “ripped off” is too strong a word, but SenSage pioneered (many aspects of) Vertica’s platform.

  • Interesting comments on related companies.

    As you may know, Vertica’s solutions are based on storing tables by column, which can lead to compression benefits and reduce I/O costs. That is independent of parallelism and scalability: you can have parallelism without columnar storage, columnar storage without parallelism, or you can have both. Compression can be useful, but it’s a one-time win: it can buy you a nice I/O reduction factor when you implement them (often in the 5-10x range on real-world data), but that’s the end of the fun. Moore’s Law cannot improve your compression rates over time. (Shannon trumps Moore!)

    By contrast, parallelism scales with data growth: Moore’s Law will let you buy 2x the number of machines for the same price 18 months from now, which you can harness if you’re (scalably) parallel. Parallelism is a gift that keeps on giving. So if you’re a forward-looking consumer, demand parallelism, and ask for compression too!

    If you’re interested in database compression techniques, IBM’s Blink research project absolutely knocked my socks off when I saw it demonstrated last year. They’re getting to the theoretical entropy bounds of compression in a working system, focused on main-memory databases. It’s not column-based, by the way. Columns are not the point, compression is — and you should take it however you can get it.

    For what it’s worth, I can also shine some light on the personal relationships here — the database world is pretty small, and a large segment of it centers around Mike Stonebraker and his graduate students and companies over 30+ years. So there tend to be a lot of connections, and they’re mostly friendly, at least in that circle.

    Addamark (now Sensage) was founded by Adam Sah and Mark Searle. Adam was one of Mike Stonebraker’s graduate students at that time at Berkeley. Both Stonebraker and I were Addamark technical advisors. They had some neat technology, and Adam was definitely driving a lot of it. Adam put many of those ideas into the public domain in a paper at Lisa 2002. Actually, the general idea of column storage goes back at least to Sybase IQ in the 1990’s, and likely earlier.

    Meanwhile, to continue the small-world phenomenon, another company mentioned above is LogSavvy, which was founded by the same Mark Searle who co-founded Addamark. I am a technical advisor at LogSavvy. The Chief Architect at LogSavvy, Cimmaron Taylor, was a developer on Stonebraker’s Postgres research project at Berkeley, and an engineer at Addamark.

    I should mention that I am also a former graduate student of Mike Stonebraker’s on the Postgres project. And I’m a technical advisor at Greenplum.

    As you probably know, Postgres led to PostgreSQL, which has been used as a base by a number of commercial players in this space, including Greenplum, Netezza, and Aster Data.

    Finally, to complete the loop: Vertica is based on the CStore research project at MIT, which was advised by Profs. Mike Stonebraker and Sam Madden. Madden, who is a tech advisor at Vertica, did his PhD at Berkeley with Mike Franklin and me. Which makes him Stonebraker’s grandstudent.
    Did you get all that? There are yet more connections into the Hadoop world, but this comment is in danger of being longer than the post…

  • +1

    In terms of chronology, it’s true that Sybase IQ predates Addamark/SenSage, which in turn predates Vertica. Finally, no conversation would be complete without a mention of Google Sawmill/Sawzall ( which is roughly contemporaneous with Addamark.

    Many people credit Addamark with drawing attention to logs as an interesting dataset to store and manage (not just analyze and discard) and many companies were partially inspired, likely including LogSavvy, LogLogic, ArcSight Logger and Splunk. Unlike academia, for legal reasons, companies rarely admit to their inspirations. It’s also worth noting that inspiration is only that– these companies have different architectures and sell to different markets where they add many key components and features besides the core data management, e.g. SenSage has an entire application and reporting stack designed for each of its markets.

    Going the other way, it’s helpful to remember that “imitation is the sincerest form of flattery.” After Addamark/SenSage, I helped invent Google Gadgets which spawned a huge number of projects, some which bear the name (e.g. Gadget Ads, Gmail Gadgets) and some of which don’t (e.g. OpenSocial, Yahoo Application Platform, Mapplets, etc.). Many musicians have called their songs “children” and talk about how they live beyond the initial act of writing them. I see software the same way, and I’m personally thrilled to see all the interest.

    best wishes,

  • bob

    I think “White Cross” (unsure) was a UK based company selling a column oriented h/w based solution before Sybase IQ

  • Anonymous

    See CloudBase-

    It is a data warehouse system built on top of Hadoop’s Map Reduce architecture that allows one to query Terabyte and Petabyte of data using ANSI SQL. It comes with a JDBC driver so one can use third party BI tools, reporting frameworks to directly connect to CloudBase.

    CloudBase creates a database system directly on flat files and converts input ANSI SQL expressions into map-reduce programs for processing flat files. It has an optimized algorithm to handle Joins and plans to support table indexing in next release.

  • srikanth

    would like to view and study the proc and cons of mapreduce and what r the alternative technology available.