"Data Pipelines" entries

Integrate, catalog, and preserve metadata

Dr. Clare Bernard, former particle physicist at CERN, on solutions for discovering, organizing, and visualizing enterprise data.

by Shannon Cutt | @ShannonCutt | September 24, 2015

During a special edition of The O’Reilly Podcast, host and O’Reilly chief data scientist Ben Lorica interviewed Dr. Clare Bernard, a former particle physicist at CERN, who worked on the ATLAS experiment at the Large Hadron Collider. Bernard is now a field engineer at Tamr, where she’s involved in a new project that aims to integrate and catalog a variety of data across an enterprise, while preserving metadata.

Key takeaways from their chat:

A lot of companies have big top-down master data management projects, and they put in place a lot of data-governance tools, which typically don’t scale very well.
It’s really important to track where the data came from, what the fields mean, and what transformations have been applied to that data over time, so that you can then use it for your analytics and you really understand what it means.
Tracking metadata allows you to reproduce your data pipelines, and understand the lineage, and provenance of your data.

Ben Lorica: Let’s start with a little bit about your background. You are a scientist by training, right?

Clare Bernard: Yes. I was a particle physicist. I worked at CERN for a couple years and worked on the ATLAS experiment at the Large Hadron Collider. Then I got my Ph.D. and graduated in May. I’ve been working at Tamr since then, as a field engineer. Read more…

Three best practices for building successful data pipelines

Reproducibility, consistency, and productionizability let data scientists focus on the science.

by Michael Li | @tianhuil | September 15, 2015

Building a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

Reproducible
Consistent
Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Read more…