Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In March 2015, database pioneer Michael Stonebraker was awarded the 2014 ACM Turing Award “for fundamental contributions to the concepts and practices underlying modern database systems.” In this week’s Radar Podcast, O’Reilly’s Mike Hendrickson sits down with Stonebraker to talk about winning the award, the future of data science, and the importance — and difficulty — of data curation.
One size does not fit all
Stonebraker notes that since about 2000, everyone has realized they need a database system, across markets and across industries. “Now, it’s everybody who’s got a big data problem,” he says. “The business data processing solution simply doesn’t fit all of these other marketplaces.” Stonebraker talks about the future of data science — and data scientists — and the tools and skill sets that are going to be required:
It’s all going to move to data science as soon as enough data scientists get trained by our universities to do this stuff. It’s fairly clear to me that you’re probably not going to retread a business analyst to be a data scientist because you’ve got to know statistics, you’ve got to know machine learning. You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.
All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.
Getting meaning out of unstructured data
Gathering, processing, and analyzing unstructured data presents unique challenges. Stonebraker says the problem really is with semi-structured data, and that “relational database systems are doing just fine with that”:
When you say unstructured data, you mean one of two things. You either mean text or you mean semi-structured data. Mostly, the NoSQL guys are talking about semi-structured data. When you say unstructured data, I think text. … Everybody who’s trying to get meaning out of text has an application-specific parser because they’re not interested in general natural language processing. They’re interested in specific kinds of things. They’re all turning that into semi-structured data. The real problem is on semi-structured data. Text is converted to semi-structured data. … I think relational database systems are doing just fine on that. … Most any database system is happy to ingest that stuff. I don’t see that being a hard problem.
Data curation at scale
Data curation, on the other hand, is “the 800-pound gorilla in the corner,” says Stonebraker. “You can solve your volume problem with money. You can solve your velocity problem with money. Curation is just plain hard.” The traditional solution of extract, transform, and load (ETL) works for 10, 20, or 30 data sources, he says, but it doesn’t work for 500. To curate data at scale, you need automation and a human domain expert. Stonebraker explains:
If you want to do it at scale — 100s, to 1000s, to 10,000s — you cannot do it by manually sending a programmer out to look. You’ve got to pick the low-hanging fruit automatically, otherwise you’ll never get there; it’s just too expensive. Any product that wants to do it at scale has got to apply machine learning and statistics to make the easy decisions automatically.
The second thing it has to do is, go back to ETL. You send a programmer out to understand the data source. In the case of Novartis, some of the data they have is genomic data. Your programmer sees an ICU 50 and an ICE 50, those are genetic terms. He has no clue whether they’re the same thing or different things. You’re asking him to clean data where he has no clue what the data means. The cleaning has to be done by what we could call the business owner, somebody who understands the data, and not by an IT guy. … You need domain knowledge to do the cleaning — pick the low-hanging fruit automatically and when you can’t do that, ask a domain expert, who invariably is not a programmer. Ask a human domain expert. Those are the two things you’ve got to be able to do to get stuff done at scale.
Stonebraker discusses the problem of curating data at scale in more detail in his contributed chapter in a new free ebook, Getting Data Right.
This Radar Podcast was a collaboration between O’Reilly, Tamr, and VoltDB. See our statement of editorial independence.
Cropped public domain image on article and category pages via The British Library on Flickr.