Big data, interactive access: How Apache Drill makes it easy

True SQL queries? Yes. Parquet and other complex data structures? Yes. Drill 1.1 is full of surprises.

Public_domain_image_Britsh_Library_Flickr

Register for the free webcast “Easy, real-time access to data with Apache Drill,” which will be held Thursday, July 30, 2015, at 10 a.m. PT. This panel discussion will explore the major role SQL-on-Hadoop technologies play in organizations.

Big data techniques are becoming mainstream in an increasing number of businesses, but how do people get self-service, interactive access to their big data? And how do they do this without having to train their SQL-literate employees to be advanced developers?

One solution is to take advantage of the rapidly maturing open source, open community software tool known as Apache Drill. Drill is not the first SQL-on-Hadoop tool. It is, however, a new and very sophisticated highly scalable SQL query engine that has been built from the ground up to be appropriate for use even in production settings. Drill extends query capabilities to a variety of new data sources and formats without the requirement for IT intervention that might be expected from a SQL query engine. In short, Drill allows self-exploration of data by providing flexibility along with performance.

As capabilities in the big data world have progressed, our understanding of what is needed for high-performance, enterprise-grade architectures have also increased. A need for a SQL solution for the Hadoop and NoSQL space was recognized fairly early, and it’s not surprising that to meet an urgent need, some of the first tools approached the problem with SQL-like syntax and made compromises that led to limitations in the data sources and formats they could handle well.

For those of you who are early Hadoop users, you’ve built up experience with SQL-like queries on data stored in Hadoop-based platforms, but most likely, you’ve also got a mental “wish list” of the things you’d like to see improved or added in a SQL-on-Hadoop query engine. The good news for you is that Apache Drill addresses many items likely to be on your wish list.

And for those of you new to the big data world of Hadoop and NoSQL, your good news is that Drill just made it much easier to step into this space and take advantage of new data sources and cost-effective platforms for handling data at scale. How? Drill does this in part by letting you build on your years of experience with SQL and familiar BI tools. Drill is SQL, not “SQL-like.”  But Drill is more than just another SQL query system: Drill also extends SQL and handles complex data cleanly without a requirement for a pre-defined schema.

Having been built to meet the needs that were projected for a maturing big data arena, Drill is a tool that should prove valuable in current use cases.

Look at these examples of what Drill can do:

  • Drill uses standard ANSI SQL syntax for queries to meet the need for familiar tools and approaches, especially for business analysts.
  • Drill views address the need for more granular security for big data.
  • Drill was designed for performance on data stored with the new columnar data formats such as Apache Parquet even when the data stored in this format uses complex structures (nested data).

How useful is Drill’s performance with Parquet data? Very. Increasingly, people want to take advantage of Parquet’s compression and efficient structure. Netflix and Twitter are two well-known organizations using Parquet at large scale. Drill also handles nested data stored as JSON, another data format being widely used, in part thanks to widespread use of JavaScript in Web applications. Drill can even analyze the heavily nested data contained in Drill’s own performance logs (or even Impala’s logs). Other query tools, especially those based on SQL, have a very hard time dealing with such complex data.

This is an exciting time to get involved with Drill. Version 1.1 was released less than a month ago, with some excellent additions, including automatic partitioning of Parquet files, new windowing functions, FLATTEN support for very large complex objects, improved data access via a better JDBC driver, and improvements for the MongoDB plug-in.

Year of the Drill user

One of the most exciting aspects of the Apache Drill project is the new round of innovation underway as the community of users grows. Users are putting Drill to work as it was originally designed to be used: to help them make better use of cost-effective, scalable Hadoop and NoSQL technologies, to get better time-to-value in their projects, and to discover new insights from their data.

Users are also beginning to use Drill in ways we haven’t predicted; Drill users are now the innovators.

To explore what Drill can do and how you might use it, come join a Drill discussion in the free O’Reilly Community webcast sponsored by MapR Technologies on Thursday July 30, 2015, at 10:00 a.m. PT. You’ll hear a conversation that includes the viewpoints from someone who helped build Drill and someone who uses it.

The panel includes:

  • A leading Drill architect and Apache Drill PMC chair, Jacques Nadeau
  • An early end-user of Drill and Chief Architect for Data and Information Management at Cisco, Piyush Bhargava
  • A big data expert and Research Director at 451 Research, Matt Aslett
  • A Hadoop expert and VP of Marketing at MapR, Steve Wooledge

Register for the webcast here.

This post is a collaboration between O’Reilly and MapR. See our statement of editorial independence.

Cropped public domain image via the British Library on Flickr.

tags: , , , , , ,