The O’Reilly Data Show podcast: Joe Hellerstein on data wrangling, distributed systems, and metadata services.
In this episode of the O’Reilly Data Show, I spoke with one of the most popular speakers at Strata+Hadoop World: Joe Hellerstein, professor of Computer Science at UC Berkeley and co-founder/CSO of Trifacta. We talked about his past and current academic research (which spans HCI, databases, and systems), data wrangling, large-scale distributed systems, and his recent work on metadata services.
Data wrangling and preparation
The most interactive tasks that people do with data are essentially data wrangling. You’re changing the form of the data, you’re changing the content of the data, and at the same time you’re trying to evaluate the quality of the data and see if you’re making it the way you want it. … It’s really actually the most immersive interaction that people do with data and it’s very interesting.
Comprehensive metadata collection and analysis can pave the way for many interesting applications.
As I spoke with friends leading up to Strata + Hadoop World NYC 2015, one topic continued to come up: metadata. It’s a topic that data engineers and data management researchers have long thought about because it has significant effects on the systems they maintain and the services they offer. I’ve also been having more and more conversations about applications made possible by metadata collection and analysis.
At the recent Strata + Hadoop World, U.C. Berkeley professor and Trifacta co-founder Joe Hellerstein outlined the reasons why the broader data industry should rally to develop open and vendor-neutral metadata services. He made the case that improvements in metadata collection and sharing can lead to interesting applications and capabilities within the industry.
Below are some of the reasons why Hellerstein believes the data industry should start focusing more on metadata:
Improved data analysis: metadata-on-use
You will never know your data better than when you are wrangling and analyzing it. — Joe Hellerstein
A few years ago, I observed that context-switching — due to using multiple frameworks — created a lag in productivity. Today’s tools have improved to the point that someone using a single framework like Apache Spark can get many of their data tasks done without having to employ other programming environments. But outside of tracking in detail the actions and choices analysts make, as well as the rationales behind them, today’s tools still do a poor job of capturing how people interact and work with data.
Dr. Clare Bernard, former particle physicist at CERN, on solutions for discovering, organizing, and visualizing enterprise data.
During a special edition of The O’Reilly Podcast, host and O’Reilly chief data scientist Ben Lorica interviewed Dr. Clare Bernard, a former particle physicist at CERN, who worked on the ATLAS experiment at the Large Hadron Collider. Bernard is now a field engineer at Tamr, where she’s involved in a new project that aims to integrate and catalog a variety of data across an enterprise, while preserving metadata.
Key takeaways from their chat:
- A lot of companies have big top-down master data management projects, and they put in place a lot of data-governance tools, which typically don’t scale very well.
- It’s really important to track where the data came from, what the fields mean, and what transformations have been applied to that data over time, so that you can then use it for your analytics and you really understand what it means.
- Tracking metadata allows you to reproduce your data pipelines, and understand the lineage, and provenance of your data.
Ben Lorica: Let’s start with a little bit about your background. You are a scientist by training, right?
Clare Bernard: Yes. I was a particle physicist. I worked at CERN for a couple years and worked on the ATLAS experiment at the Large Hadron Collider. Then I got my Ph.D. and graduated in May. I’ve been working at Tamr since then, as a field engineer. Read more…
Surprising social media stats
I’ve been filtering Twitter’s firehose for tweets about “#Syria” for about the past week in order to accumulate a sizable volume of data about an important current event. As of Friday, I noticed that the tally has surpassed one million tweets, so it seemed to be a good time to apply some techniques from Mining the Social Web and explore the data.
While some of the findings from a preliminary analysis confirm common intuition, others are a bit surprising. The remainder of this post explores the tweets with a cursory analysis addressing the “Who?, What?, Where?, and When?” of what’s in the data.
Harvard offers big data for books, Cloudera's new Hadoop distribution, Splunk goes public.
In this week's data news, Harvard releases millions of library catalog records as part of its Open Metadata Policy, Cloudera unveils the fourth version of its Hadoop distribution, and big data company Splunk has its IPO.
Laura Dawson on why metadata is integral to each stage of publishing.
Publishing consultant Laura Dawson says publishers are starting to come around to the importance of metadata, but they still don't quite get it.
From HTML5 to metadata to managing rights, increasingly complex content management issues fall squarely on publishers.
It's more challenging than ever to handle all aspects of content management internally. In this podcast, Firebrand Technologies founder and president Fran Toolan addresses a myriad of content management issues.
Valla Vakili on Small Demons, contextual discovery and a very different type of metadata.
While some companies try to solve the recommendation problem, Small Demons has other ideas. "Discovery is the ultimate problem we're trying to solve and the ultimate value we're trying to create," says Small Demon's founder and CEO Valla Vakili in this interview.
Sebastian Posth on the complexity of digital publishing rights.
Digital publishing has stirred up a number of issues that didn't exist in traditional publishing. In this interview, Sebastian Posth, a partner at A2 Electronic Publishing and a speaker at TOC Frankfurt, talks about the unique issues and why the waters are so muddy.
Decoding book DNA, parsing Wikipedia with WikiHadoop, and the rise of the "Data Civilization"
BookLamp and the Book Genome Project look to book DNA for smarter recommendations, sorting through Wikipedia's vast data dump gets easier thanks to WikiHadoop, and a timeline from WolframAlpha charts major milestones in data knowledge.