Lessons from next-generation data wrangling tools

Drawing inspiration from recent advances in data preparation.

DSC_6826_4754_Flickr

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.

Empower domain experts

In many instances, you need subject area experts to explain specific data sets that you’re not familiar with. These experts can place data in context and are usually critical in helping you clean and consolidate variables. Trifacta has tools that enable non-programmers to take on data wrangling tasks that used to require a fair amount of scripting.

Consider DSLs and visual interfaces

“Programs written in a [domain specific language] (DSL) also have one other important characteristic: they can often be written by non-programmers…a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.”

Paul Hudak, 1997

I’ve often used regular expressions for data wrangling, only to come back later unable to read the code I wrote (Joe Hellerstein describes regex as “meant for writing & never reading again”). Programs written in DSLs are concise, easier to maintain, and can often be written by non-programmers.

Trifacta designed a “readable” DSL for data wrangling but goes one step further: their users “live in visualizations, not code.” Their elegant visual interface is designed to accomplish most data wrangling tasks, but it also lets users access and modify accompanying scripts written in their DSL (power users can also use regular expressions).

These ideas go beyond data wrangling. Combining DSLs with visual interfaces can open up other aspects of data analysis to non-programmers.

Intelligence and automation

If you’re dealing with thousands of data sources, then you’ll need tools that can automate routine steps. Tamr’s next-generation extract, transform, load (ETL) platform uses machine learning in a variety of ways, including schema consolidation and expert (crowd) sourcing.

Many data analysis tasks involve a handful of data sources that require painstaking data wrangling along the way. Scripts to automate data preparation are needed for replication and maintenance. Trifacta looks at user behavior and context to produce “utterances” of its DSL, which users can then edit or modify.

Don’t forget about replication

If you believe the adage that data wrangling consumes a lot of time and resources, then it goes without saying that tools like Tamr and Trifacta should produce reusable scripts and track lineage. Other aspects of data science — for example, model building, deployment, and maintenance — need tools with similar capabilities.

For recent developments in data preparation and data curation, come to Strata+Hadoop World in San Jose and attend presentations by Joe Hellerstein and Mike Stonebraker.

Cropped image by 白士 李 on Flickr, used under a Creative Commons license.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • buggyfunbunny

    IFF DSLs were the answer, then all those 4GLs from the ’80s and ’90s would be ruling the world, not java and C and COBOL.

  • There is a lot more to data wrangling than Big Data. Sure, “data scientists” spend a lot of time wrangling big data in Hadoop. But what about all those “regular analysts” who don’t have a formal data warehouse maintained by an IT staff, but who need to blend and integrate data from a host of complex, heterogeneous systems.

    For example, a typical marketing analyst might need to blend and analyze data from Salesforce, Marketo, Eloqua, Hubspot, MS-Dynamics, and so forth. Or a business analyst might need to combine data from on-premise sources like Oracle, MS-SQLServer, DB2, or MySQL with data from cloud sources like Google Analytics and Salesforce, not to mention data sitting on desktops in Excel, as well as big data in MongoDB and Hadoop.

    That’s a lot of complex data blending for non-scientists! But the good news is, there is a new crop of tools (in addition to Tamr and Trifacta) for letting “typical analysts” do that integration and blending themselves. Tools like Informatica Rev, Paxata, and Progress Easyl are designed to make complex data wrangling and blending fairly easy. (Caveat, I’m the Progress Easyl product manager, but I wanted to be fair in mentioning all the competition.)

    Good luck with your wrangling!

  • DF

    And like Rich to be fair, I’m a VP at IRI where CoSort’s been wrangling data long before it was called that (google data franchising) … just to throw another DSL (The CoSort SortCL 4GL program) into the mix. CoSort is supported in an Eclipse GUI for auto-prep-job building and visual workflow, but more importantly combines huge data transformations in the same pass with data masking, cleansing, and sub-setting (hand-offs) for BI and analytic tools so visualizations happen many times sooner than they ordinarily would. Regardless of the data preparation tool, we’ve found that the externalization of data integration is beneficial for several other reasons, including: removing that burden from the BI layer, enabling the reuse of the data, and mitigating the risk of data synchronization problems.