Big data, interactive access: How Apache Drill makes it easy

True SQL queries? Yes. Parquet and other complex data structures? Yes. Drill 1.1 is full of surprises.

Public_domain_image_Britsh_Library_Flickr

Register for the free webcast “Easy, real-time access to data with Apache Drill,” which will be held Thursday, July 30, 2015, at 10 a.m. PT. This panel discussion will explore the major role SQL-on-Hadoop technologies play in organizations.

Big data techniques are becoming mainstream in an increasing number of businesses, but how do people get self-service, interactive access to their big data? And how do they do this without having to train their SQL-literate employees to be advanced developers?

One solution is to take advantage of the rapidly maturing open source, open community software tool known as Apache Drill. Drill is not the first SQL-on-Hadoop tool. It is, however, a new and very sophisticated highly scalable SQL query engine that has been built from the ground up to be appropriate for use even in production settings. Drill extends query capabilities to a variety of new data sources and formats without the requirement for IT intervention that might be expected from a SQL query engine. In short, Drill allows self-exploration of data by providing flexibility along with performance.

As capabilities in the big data world have progressed, our understanding of what is needed for high-performance, enterprise-grade architectures have also increased. A need for a SQL solution for the Hadoop and NoSQL space was recognized fairly early, and it’s not surprising that to meet an urgent need, some of the first tools approached the problem with SQL-like syntax and made compromises that led to limitations in the data sources and formats they could handle well.

For those of you who are early Hadoop users, you’ve built up experience with SQL-like queries on data stored in Hadoop-based platforms, but most likely, you’ve also got a mental “wish list” of the things you’d like to see improved or added in a SQL-on-Hadoop query engine. The good news for you is that Apache Drill addresses many items likely to be on your wish list.

And for those of you new to the big data world of Hadoop and NoSQL, your good news is that Drill just made it much easier to step into this space and take advantage of new data sources and cost-effective platforms for handling data at scale. How? Drill does this in part by letting you build on your years of experience with SQL and familiar BI tools. Drill is SQL, not “SQL-like.”  But Drill is more than just another SQL query system: Drill also extends SQL and handles complex data cleanly without a requirement for a pre-defined schema.

Having been built to meet the needs that were projected for a maturing big data arena, Drill is a tool that should prove valuable in current use cases.

Look at these examples of what Drill can do:

  • Drill uses standard ANSI SQL syntax for queries to meet the need for familiar tools and approaches, especially for business analysts.
  • Drill views address the need for more granular security for big data.
  • Drill was designed for performance on data stored with the new columnar data formats such as Apache Parquet even when the data stored in this format uses complex structures (nested data).

How useful is Drill’s performance with Parquet data? Very. Increasingly, people want to take advantage of Parquet’s compression and efficient structure. Netflix and Twitter are two well-known organizations using Parquet at large scale. Drill also handles nested data stored as JSON, another data format being widely used, in part thanks to widespread use of JavaScript in Web applications. Drill can even analyze the heavily nested data contained in Drill’s own performance logs (or even Impala’s logs). Other query tools, especially those based on SQL, have a very hard time dealing with such complex data.

This is an exciting time to get involved with Drill. Version 1.1 was released less than a month ago, with some excellent additions, including automatic partitioning of Parquet files, new windowing functions, FLATTEN support for very large complex objects, improved data access via a better JDBC driver, and improvements for the MongoDB plug-in.

Year of the Drill user

One of the most exciting aspects of the Apache Drill project is the new round of innovation underway as the community of users grows. Users are putting Drill to work as it was originally designed to be used: to help them make better use of cost-effective, scalable Hadoop and NoSQL technologies, to get better time-to-value in their projects, and to discover new insights from their data.

Users are also beginning to use Drill in ways we haven’t predicted; Drill users are now the innovators.

To explore what Drill can do and how you might use it, come join a Drill discussion in the free O’Reilly Community webcast sponsored by MapR Technologies on Thursday July 30, 2015, at 10:00 a.m. PT. You’ll hear a conversation that includes the viewpoints from someone who helped build Drill and someone who uses it.

The panel includes:

  • A leading Drill architect and Apache Drill PMC chair, Jacques Nadeau
  • An early end-user of Drill and Chief Architect for Data and Information Management at Cisco, Piyush Bhargava
  • A big data expert and Research Director at 451 Research, Matt Aslett
  • A Hadoop expert and VP of Marketing at MapR, Steve Wooledge

Register for the webcast here.

This post is a collaboration between O’Reilly and MapR. See our statement of editorial independence.

Cropped public domain image via the British Library on Flickr.

tags: , , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Ilya Geller

    SQL-on-Hadoop technologies obsolete.
    SQL, Structured Query Language obtains and uses patterns from queries and statistics on how often they are employed; neither the queries, nor patterns, nor statistics have anything in common with data itself, they are EXTERNAL.
    I, however, discovered and patented how to structure any data without SQL, the queries – INTERNALLY: Language has its own INTERNAL parsing, indexing and statistics and can be structured INTERNALLY. (For more details please browse on my name ‘Ilya Geller’.)
    For instance, there are two sentences:
    a) ‘Sam!’
    b) ‘A loud ringing of one of the bells was followed by the appearance of a smart chambermaid in the upper sleeping gallery, who, after tapping at one of the doors, and receiving a request from within, called over the balustrades -‘Sam!’.’
    Evidently, that the ‘Sam’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain ‘Sam’, weights: the first has 1, the second – 0.08; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.
    Being structured information (for instance, advertisements) searches for passive, invisible at Internet people, based on their profiles of structured data.
    SQL cannot produce the above statistics and know about my other novelties – SQL depicts data from outside, and no portrait can create an ideal representation of its object – SQL and whatever uses SQL are obsolete and out of business.