Although next-gen big data tools such as Hadoop, Spark, and MongoDB are finding more and more uses, most organizations need to maintain data in traditional relational stores as well. Deriving the benefits of both key/value stores and relational databases takes a lot of juggling. Three basic strategies are currently in use.
- Double up on your data storage. Log everything in your fast key/value repository and duplicate part of it (or perform some reductions and store the results) in your relational data warehouse.
- Store data primarily in a relational data warehouse, and use extract, transform, and load (ETL) tools to make it available for analytics. These tools run a fine-toothed comb through data to perform string manipulation, remove outlier values, etc. and produce a data set in the format required by data processing tools.
- Put each type of data into the repository best suited to it––relational, Hadoop, etc.––but run queries between the repositories and return results from one repository to another for post-processing.
The appeal of the first is a large-scale simplicity, in that it uses well-understood systems in parallel. The second brings the familiarity of relational databases for business users to access. This article focuses on the third solution, which has advantages over the others: it avoids the redundancy of the first solution and is much easier to design and maintain than the second. I’ll describe how it is accomplished by Teradata, through its appliances and cloud solutions, but the building blocks are standard, open source tools such as Hive and HCatalog, so this strategy can be implemented by anyone.
The key to yoking a relational database together with Hadoop in this architecture is to use Hive to impose more structure on the data than one typically uses for MapReduce jobs in Hadoop. Hive, which uses a query syntax modeled on SQL, maintains extra metadata on the Hadoop data that mimics the columns and data types familiar to SQL.
Let’s say now that you want to run data in Hadoop through SQL queries on your relational data warehouse. There are compelling reasons to do this: powerful relational databases have features not provided by Hive, and offer faster performance on many queries.
According to Ariff Kassam, Vice President of Product Management at Teradata, the best way to get data from Hadoop into a relational database query is to create a view in the relational database to contain the desired data. Do not materialize the view, but specify the desired Hive columns. When a query is run against the view in the relational database, Teradata uses HCatalog to load the view with the Hadoop data. In addition, data written to the view is written back to HDFS.
Software is needed to manage this data flow: Teradata supports such queries through software called QueryGrid both on hardware appliances and in the cloud. QueryGrid also allows you to feed data from the relational or Hadoop stores to analytical platforms. In Teradata’s suite, that role is played by Aster, which offers graphs and other advanced analytical tools. Aster has its own big data solution for executing queries, called SQL-MapReduce. QueryGrid has connectors to a number of other databases as well, such as Oracle and MongoDB.
Further, metadata management and data lineage play a key role in data governance. Teradata Loom scans all data entering Hadoop and adds metadata, including its lineage (when the data was generated and by whom). The metadata from Loom not only helps users find data they need, but helps them judge its reliability and the purpose for which it was collected. You also can structure Hadoop or HDFS data in such a way that it’s accessible to SQL as well.
The advantages afforded by integrating a relational database with Hadoop allow for an almost seamless sharing between the database and Hadoop. And you can run all the analytics supported by both sides without duplicating data.
This post is part of a collaboration between O’Reilly and Teradata. See our statement of editorial independence.