Cloudera Impala: Bringing the SQL and Hadoop Worlds Together

By John Russell

When I came to work on the Cloudera Impala project, I found many things that were familiar from my previous experience with relational databases, UNIX systems, and the open source world. Yet other aspects were all new to me. I know from documenting both enterprise software and open source projects that it’s a special challenge when those two aspects converge. A lot of new users come in with 95% of the information they need, but they don’t know where the missing or outdated 5% is. One mistaken assumption or unfamiliar buzzword can make someone feel like a complete beginner. That’s why I was happy to have the opportunity to write this overview article, with room to explore how users from all kinds of backgrounds can understand and start using the Cloudera Impala product.

For database users, the Apache Hadoop ecosystem can feel like a new world:

Sysadmins don’t bat an eye when you say you want to work on terabytes or petabytes of data.
A networked cluster of machines isn’t a complicated or scary proposition. Instead, it’s the standard environment you ask an intern to set up on their first day as a training exercise.
All the related open source projects aren’t an either-or proposition. You work with a dozen components that all interoperate, stringing them together like a UNIX toolchain.

Instead of working to a rigid predefined schema, you can bring in raw data and define the columns and data types afterwards, without copying or reorganizing the data. (Hadoop users call this notion “schema on read.”)
You can query or produce the same data files through SQL and other Hadoop components and frameworks. Being able to access the same files in flexible ways helps to shorten the ETL window by avoiding copying and conversion steps.
There’s a clearer divide between data warehouse-style operations (such as partitioning, large block sizes, columnar storage, and so on) and traditional OLTP operations (such as single-row queries or partial updates).
Some parts of the system show their roots with interfaces that feel Java-like, Ruby-like, and so on. (Impala hides those implementation details to give a familiar feel to command-line interaction and API-driven operations.)

For experienced Hadoop users, SQL can feel the same way. A shop that already uses Hadoop might already have a standardized workflow, file formats, cluster configuration, and so on. They’ve already set the low-level tuning knobs that control performance. They might already be doing SQL queries through the Apache Hive component, just with extra latency that makes that technique unsuitable if you need interactive response times. It’s important to know how Impala fits cleanly into this world, leverages the existing infrastructure and file formats, and plays nicely with all the standard Hadoop components.

I wanted to give an overview that didn’t rely on already being an expert with Hadoop, Hive, Java, some particular database system, and so on. With Impala, a little SQL and UNIX experience is all you really need. The patterns are familiar, even if the terminology is a little different. An end user doesn’t need to concern themselves with the underlying plumbing. But depending on where they’re coming from, they might have definite ideas about which logical or physical aspects are important.

I also wanted to take a different approach than the official Impala documentation, which like all product manuals has to be comprehensive for all features and use cases. That requirement leads naturally to a bottom-up, just-the-facts style. For a new user, repetition and analogies are important to help establish a good mental model. The early adopters of these technologies typically have enough expertise that they don’t need this kind of background information. They’re the ones who have already pushed past the boundaries of other technologies and have very specific requirements about how the new paradigm should work. The early adopters are already taking up and using Impala. This article is geared toward the next wave, the broad audience of SQL and database users.

Download Cloudera Impala for free.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata in Santa Clara: February 11-13 | Santa Clara, CA

Cloudera Impala: Bringing the SQL and Hadoop Worlds Together

Get the O’Reilly Data Newsletter