The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill (Teradata, Netezza, Vertica, DATAllegro, Greenplum, Aster Data, Infobright, Kognitio, Kickﬁre, Dataupia, ParAccel, Exasol, …). As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.
HadoopDB is comprised of Postgres on each node (database layer), Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running Postgres, and Hive as the translation layer. The result is a shared-nothing parallel database, that business analysts can interact with using a SQL-like language. [Technical details can be found in the following paper.]
We recently spent an hour discussing Big Data and HadoopDB with Yale CS Professor (and HadoopDB co-creator) Daniel Abadi. One of the main motivations for building HadoopDB was the desire to make available an open source parallel database. While some analytic database vendors have built parallel systems using open source databases (e.g. Aster Data and Greenplum use Postgres), the resulting products aren’t open source.
By taking advantage of Hadoop (particularly HDFS, scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters. Especially in cloud computing environments, where there might be wild fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. Given that the performance of current parallel databases scale (near linearly) as more nodes are added, vendors strive to develop systems that can be easily deployed on large clusters. Current parallel databases have been deployed mostly on systems with less than a hundred nodes. OTOH, the use of Hadoop technology allows HadoopDB to easily scale to hundreds (if not thousands) of nodes.
Generally speaking, Professor Abadi places HadoopDB somewhere between Hadoop and parallel databases when it comes to the trade-off between load (data loads are slower than Hadoop, but faster than parallel databases) and runtime (on structured data, HadoopDB is faster than Hadoop but slower than parallel databases). Below are some graphs from a series of tests conducted by the HadoopDB team:
Performance on Data Loads
Performance on Analytic Tasks
In our report on Big Data Management Technologies, we highlighted that (given the lack of upfront relational data modeling) Hadoop and other simple key-value databases encouraged experimentation that could lead to quick insights. But as query patterns emerge, ” … more refined data structures, data transformation, and data access processes can be built (including interfaces to relational RDBMSs) that make subsequent inquiries easy to repeat.” In practice this means throwing data into Hadoop, observing how users interact with the data, then building relational data marts accordingly. The vision of the HadoopDB development team fits perfectly into this workflow. Over time, the HadoopDB team envisions their system to initially load all the data into HDFS, then take advantage of query patterns to dynamically load the right data slices into relational data structures.
Admittedly, the HadoopDB team needs to release tools to make their system easier to use/deploy. The HadoopDB development team is comprised entirely of Yale CS Department members, although Professor Abadi is hoping that open source developers will start contributing to the project. But if a paid gig is what you’re after, the good news is that they’re in search of a Chief Hacker.
() We were among the first users of Greenplum. In partnership with SimplyHired and Greenplum, we actively maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005.