"metadata services" entries
The O’Reilly Data Show podcast: Joe Hellerstein on data wrangling, distributed systems, and metadata services.
In this episode of the O’Reilly Data Show, I spoke with one of the most popular speakers at Strata+Hadoop World: Joe Hellerstein, professor of Computer Science at UC Berkeley and co-founder/CSO of Trifacta. We talked about his past and current academic research (which spans HCI, databases, and systems), data wrangling, large-scale distributed systems, and his recent work on metadata services.
Data wrangling and preparation
The most interactive tasks that people do with data are essentially data wrangling. You’re changing the form of the data, you’re changing the content of the data, and at the same time you’re trying to evaluate the quality of the data and see if you’re making it the way you want it. … It’s really actually the most immersive interaction that people do with data and it’s very interesting.
Comprehensive metadata collection and analysis can pave the way for many interesting applications.
As I spoke with friends leading up to Strata + Hadoop World NYC 2015, one topic continued to come up: metadata. It’s a topic that data engineers and data management researchers have long thought about because it has significant effects on the systems they maintain and the services they offer. I’ve also been having more and more conversations about applications made possible by metadata collection and analysis.
At the recent Strata + Hadoop World, U.C. Berkeley professor and Trifacta co-founder Joe Hellerstein outlined the reasons why the broader data industry should rally to develop open and vendor-neutral metadata services. He made the case that improvements in metadata collection and sharing can lead to interesting applications and capabilities within the industry.
Below are some of the reasons why Hellerstein believes the data industry should start focusing more on metadata:
Improved data analysis: metadata-on-use
You will never know your data better than when you are wrangling and analyzing it. — Joe Hellerstein
A few years ago, I observed that context-switching — due to using multiple frameworks — created a lag in productivity. Today’s tools have improved to the point that someone using a single framework like Apache Spark can get many of their data tasks done without having to employ other programming environments. But outside of tracking in detail the actions and choices analysts make, as well as the rationales behind them, today’s tools still do a poor job of capturing how people interact and work with data.