"data science" entries

Apache Drill: Tracking its history as an open source community

A strong, open user community needs to be fostered to reveal its potential.


A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.

This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.

Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.

The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…

Comments: 2

Build better machine learning models

A beginner's guide to evaluating your machine learning models.


Everything today is being quantified, measured, and tracked — everything is generating data, and data is powerful. Businesses are using data in a variety of ways to improve customer satisfaction. For instance, data scientists are building machine learning models to generate intelligent recommendations to users so that they spend more time on a site. Analysts can use churn analysis to predict which customers are the best targets for the next promotional campaign. The possibilities are endless.

However, there are challenges in the machine learning pipeline. Typically, you build a machine learning model on top of your data. You collect more data. You build another model. But how do you know when to stop?

When is your smart model smart enough?

Evaluation is a key step when building intelligent business applications with machine learning. It is not a one-time task, but must be integrated with the whole pipeline of developing and productionizing machine learning-enabled applications.

In a new free O’Reilly report Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls, we cut through the technical jargon of machine learning, and elucidate, in simple language, the processes of evaluating machine learning models. Read more…


Big data is changing the face of fashion

How the fashion industry is embracing algorithms, natural language processing, and visual search.


Download Fashioning Data: A 2015 Update, our updated free report exploring data innovations from the fashion industry.

Fashion is an industry that struggles for respect — despite its enormous size globally, it is often viewed as frivolous or unnecessary.

And it’s true — fashion can be spectacularly silly and wildly extraneous. But somewhere between the glitzy, million-dollar runway shows and the ever-shifting hemlines, a very big business can be found. One industry profile of the global textiles, apparel, and luxury goods market reported that fashion had total revenues of $3.05 trillion in 2011, and is projected to create $3.75 trillion in revenues in 2016.

Solutions for a unique business problem

The majority of clothing purchases are made not out of necessity, but out of a desire for self-expression and identity — two remarkably difficult things to quantify and define. Yet, established brands and startups throughout the industry are finding clever ways to use big data to turn fashion into “bits and bytes,” as much as threads and buttons.

In the newly updated O’Reilly report Fashioning Data: A 2015 Update, Data Innovations from the Fashion Industry, we explore applications of big data that carry lessons for industries of all types. Topics range from predictive algorithms to visual search — capturing structured data from photographs — to natural language processing, with specific examples from complex lifecycles and new startups; this report reveals how different companies are merging human input with machine learning. Read more…


Three best practices for building successful data pipelines

Reproducibility, consistency, and productionizability let data scientists focus on the science.

Piping_EngineeringBuilding a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

  1. Reproducible
  2. Consistent
  3. Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Read more…

Comments: 6

From search to distributed computing to large-scale information extraction

The O'Reilly Data Show Podcast: Mike Cafarella on the early days of Hadoop/HBase and progress in structured data extraction.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

The_Wonders_of_The_World_British_Library_FlickrFebruary 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems. Read more…

Comment: 1

The music science trifecta

Digital content, the Internet, and data science have changed the music industry.


Download our new free report “Music Science: How Data and Digital Content are Changing Music,” by Alistair Croll, to learn more about music, data, and music science.

Today’s music industry is the product of three things: digital content, the Internet, and data science. This trifecta has altered how we find, consume, and share music. How we got here makes for an interesting history lesson, and a cautionary tale for incumbents that wait too long to embrace data.

When music labels first began releasing music on compact disc in the early 1980s, it was a windfall for them. Publishers raked in the money as music fans upgraded their entire collections to the new format. However, those companies failed to see the threat to which they were exposing themselves.

Until that point, piracy hadn’t been a concern because copies just weren’t as good as the originals. To make a mixtape using an audio cassette recorder, a fan had to hunch over the radio for hours, finger poised atop the record button — and then copy the tracks stolen from the airwaves onto a new cassette for that special someone. So, the labels didn’t think to build protection into the CD music format. Some companies, such as Sony, controlled both the devices and the music labels, giving them a false belief that they could limit the spread of content in that format.

One reason piracy seemed so far-fetched was that nobody thought of computers as music devices. Apple Computer even promised Apple Records that it would never enter the music industry — and when it finally did, it launched a protracted legal battle that even led coders in Cupertino to label one of the Mac sound effects “Sosumi” (pronounced “so sue me”) as a shot across Apple Records’ legal bow. Read more…

Comment: 1

Showcasing the real-time processing revival

Tools and learning resources for building intelligent, real-time products.

Earth orbiting sun illustration

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Read more…


Bridging the divide: Business users and machine learning experts

The O'Reilly Data Show Podcast: Alice Zheng on feature representations, model evaluation, and machine learning models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

606px-IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881As tools for advanced analytics become more accessible, data scientist’s roles will evolve. Most media stories emphasize a need for expertise in algorithms and quantitative techniques (machine learning, statistics, probability), and yet the reality is that expertise in advanced algorithms is just one aspect of industrial data science.

During the latest episode of the O’Reilly Data Show podcast, I sat down with Alice Zheng, one of Strata + Hadoop World’s most popular speakers. She has a gift for explaining complex topics to a broad audience, through presentations and in writing. We talked about her background, techniques for evaluating machine learning models, how much math data scientists need to know, and the art of interacting with business users.

Making machine learning accessible

People who work at getting analytics adopted and deployed learn early on the importance of working with domain/business experts. As excited as I am about the growing number of tools that open up analytics to business users, the interplay between data experts (data scientists, data engineers) and domain experts remains important. In fact, human-in-the-loop systems are being used in many critical data pipelines. Zheng recounts her experience working with business analysts:

It’s not enough to tell someone, “This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.” As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. Read more…

Comments: 4

Unsupervised learning, attention, and other mysteries

How to almost necessarily succeed: An interview with Google research scientist Ilya Sutskever.

Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.

633px-Jan_Steen_-_A_School_for_Boys_and_Girls_-_Google_Art_ProjectIlya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

  1. Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
  2. Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
  3. Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martens on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed. Read more…

Comment: 1

A “bottom-up” approach to data unification

How machine learning plus expert sourcing can unify customer data at scale.


Watch the free webcast Integrating Customer Data at Scale to learn how Toyota Motor Europe was able to unify its customer data at scale.

Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.

In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data. Read more…