Joseph Hellerstein
Joseph M. Hellerstein is a Professor of Computer Science at the University of California, Berkeley. Hellerstein's research focuses on data management and networking, including database systems, sensor networks, declarative networking, peer-to-peer and distributed systems.
In addition to his role in academia, Hellerstein has been a leader in the technology industry. From 2003-2005 he was Director of Intel Research, Berkeley, where he led research in networking and query processing for the Internet and for sensor networks. Hellerstein was a co-founder of Cohera Corporation (now part of Oracle), where he served as Chief Scientist from 1998-2001. Key ideas from his research have been incorporated into commercial and open-source database systems including IBM's DB2 and Informix, Oracle's PeopleSoft Catalog Management, and the open-source PostgreSQL system. He has also led a number of open-source systems projects at Berkeley, including TelegraphCQ, TinyDB, PIER and P2.
Hellerstein is a jazz enthusiast and part-time trumpeter.
Wed
Nov 19
2008
The Commoditization of Massive Data Analysis
by Joseph Hellerstein | comments: 13Big Data is a major theme on the O'Reilly Radar, so we're delighted to welcome guest blogger Joe Hellerstein, a Professor of Computer Science at UC Berkeley whose research focuses on databases and distributed systems. Joe has written a whitepaper with more detail on this topic.
There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares -- SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.
We are at the beginning of what I call The Industrial Revolution of Data. We're not quite there yet, since most of the digital information available today is still individually "handmade": prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation "factories" such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.
To get a glimpse at what that software might look like, consider today's high-end deployments. There are a few different solutions, but they typically share the core technique of dataflow parallelism. Legions of disk drives are set spinning at once, pumping data through high-speed network interconnects to racks of CPUs, which crunch the text and numbers as they flow by. High-end relational database systems like Teradata have been using this approach for decades, and in the last few years companies like Google and Yahoo! have cranked up new tools to bring this process to a scale never seen before.
tags: big data
| comments: 13
submit:
















