Ellen Friedman

Ellen Friedman is a consultant and commentator, currently writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics including molecular biology, nontraditional inheritance, and oceanography. Ellen is also co-author of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen is on Twitter at @Ellen_Friedman.

Apache Drill: Tracking its history as an open source community

A strong, open user community needs to be fostered to reveal its potential.


A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.

This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.

Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.

The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…

Comments: 2

Big data, interactive access: How Apache Drill makes it easy

True SQL queries? Yes. Parquet and other complex data structures? Yes. Drill 1.1 is full of surprises.


Register for the free webcast “Easy, real-time access to data with Apache Drill,” which will be held Thursday, July 30, 2015, at 10 a.m. PT. This panel discussion will explore the major role SQL-on-Hadoop technologies play in organizations.

Big data techniques are becoming mainstream in an increasing number of businesses, but how do people get self-service, interactive access to their big data? And how do they do this without having to train their SQL-literate employees to be advanced developers?

One solution is to take advantage of the rapidly maturing open source, open community software tool known as Apache Drill. Drill is not the first SQL-on-Hadoop tool. It is, however, a new and very sophisticated highly scalable SQL query engine that has been built from the ground up to be appropriate for use even in production settings. Drill extends query capabilities to a variety of new data sources and formats without the requirement for IT intervention that might be expected from a SQL query engine. In short, Drill allows self-exploration of data by providing flexibility along with performance.

As capabilities in the big data world have progressed, our understanding of what is needed for high-performance, enterprise-grade architectures have also increased. A need for a SQL solution for the Hadoop and NoSQL space was recognized fairly early, and it’s not surprising that to meet an urgent need, some of the first tools approached the problem with SQL-like syntax and made compromises that led to limitations in the data sources and formats they could handle well. Read more…

Comment: 1

New approaches to anomaly detection

A practical example of how anomaly detection makes complex data problems easier to solve.


As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.

At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.

Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:

  • Anomalies are defined not by their own characteristics but in contrast to what is normal.

Thus …

  • Before you can spot an anomaly, you first have to figure out what “normal” actually is.

This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…

Comment: 1

An Invitation to Practical Machine Learning

PracticalMachineLearning_covDoes it make sense for me to have a car? If so, which one is the best choice for my needs: a gasoline, hybrid, or electric?  And should I buy or lease?

In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project.  Will the payoff be worth the effort?  What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?

Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.

Read more…

Comments: 2