Integrate, catalog, and preserve metadata

During a special edition of The O’Reilly Podcast, host and O’Reilly chief data scientist Ben Lorica interviewed Dr. Clare Bernard, a former particle physicist at CERN, who worked on the ATLAS experiment at the Large Hadron Collider. Bernard is now a field engineer at Tamr, where she’s involved in a new project that aims to integrate and catalog a variety of data across an enterprise, while preserving metadata.

Key takeaways from their chat:

A lot of companies have big top-down master data management projects, and they put in place a lot of data-governance tools, which typically don’t scale very well.
It’s really important to track where the data came from, what the fields mean, and what transformations have been applied to that data over time, so that you can then use it for your analytics and you really understand what it means.
Tracking metadata allows you to reproduce your data pipelines, and understand the lineage, and provenance of your data.

Ben Lorica: Let’s start with a little bit about your background. You are a scientist by training, right?

Clare Bernard: Yes. I was a particle physicist. I worked at CERN for a couple years and worked on the ATLAS experiment at the Large Hadron Collider. Then I got my Ph.D. and graduated in May. I’ve been working at Tamr since then, as a field engineer.

BL: When you were a physicist, were you more on the computational side of physics?

CB: Yes, I was in a data-scientist-type role. I did help a little bit with collecting the data, but mostly, I spent my time analyzing the data and coming up with measuring standard model physics processes, and searching for new types of particles that we might be able to discover.

BL: What kind of tools did you use in the academic world, for that type of work?

CB: Particle physicists typically don’t buy the type of software that is used in the private world — we write a lot of it ourselves. Particle physicists use a framework called Root — it’s a statistical analysis package.

BL: Is this like a distributed framework, like the type used in big data?

CB: Yes, you can create jobs that you send off to really large computing clusters. CERN has a huge system of distributed computing, so there’s tier ones and tier twos, and you send data all around the world to do your analysis.

BL: What made you decide not to go into an academic profession?

CB: I really enjoyed my work in grad school and living at CERN, but I was very interested in figuring out what else I could do with my skills. Programming and data analysis translates well to the private world of big data.

While I was at CERN, we discovered the Higgs-Boson, which was really exciting to be a part of. Some of the other experiments proposed for the future are more related to precision measurements, rather than looking for new particles, which was what I was particularly interested in.

BL: Today, you’ve worked with several Fortune 500 companies as a Tamr field engineer. What are your thoughts on the state of enterprise data in most large organizations?

CB: Most large organizations are capturing huge amounts of data, and there is a tremendous amount of value in that data, but these enterprises are having a really difficult time getting at that value. They have very messy data that’s in many different systems. Then, on the other side of the enterprise, they’ve got analysts and data scientists who want to derive value from the data, but who have trouble figuring out what data is there, and figuring out how to get that data and bring it together. Understanding how to deal with data quality issues, and getting data into a usable format, have been big hurdles for many companies.

A major challenge is to embrace some of the messiness of enterprise data, and some of the variety of that data, so that we can really unlock all of the potential.

BL: What tools and systems are companies using for organizing and visualizing their data assets?

CB: There are a lot of patterns you see across enterprises. A lot of companies have big top-down master data management projects, and they put in place a lot of data-governance tools, which typically don’t scale very well.

Another challenge is too many systems, and then even within one system you can have duplicate records about something like a customer. Figuring out where all of the data is located, when you have many different systems, is a very challenging problem.

BL: How effective have data-governance tools been, in terms of managing different data sources, of variable quality?

CB: One big initiative that a lot of companies have started on is centralizing all of their data into a central repository (the data lake).

Getting the data into one place makes it a little bit easier to figure out what’s there, but if it’s not curated, it becomes even worse because now you’ve taken the data away from the places they were collected, and you’ve lost a lot of the information about how the data was collected.

BL: Actually, this is a topic that I’ve started to pay attention more to: metadata.

CB: Yes, exactly. A lot of companies are realizing that this is an issue, and they’re trying to track all of the metadata about the data they’re putting into HDFS, into their data lake.

BL: How are they doing that?

CB: A lot of companies are trying to build metadata catalogs. There are a few metadata catalogs, and they’re specifically made for HDFS. Very few of these that are in the market have been successful, so far.

One big issue for this, is that you don’t really want just the data that’s in your data lake and HDFS. You really want to be able to connect to all of the systems. You want it to be an easy adoption.

BL: Let’s set the stage for why metadata, and good metadata management, is important.

CB: Absolutely. Let’s say we have data about customers in a data lake, with fields called “Given Name,” “Last name,” and then there’s a field called, “Name.” We don’t really know where to start there. Which of all these name fields is actually the name of the customer?

In reality, what’s happened is, someone has created an ETL process that has taken this data from its original source system, has brought it into the data lake, and they’ve done some transformations on this name field, but now we’ve lost all of the information about what transformations were done and how that data has changed, and where the data was initially collected. Now, as the data scientist, it’s a little difficult to trust that data, and to know how to use it and how to get value out of it.

It’s really important to track where the data came from, what the fields mean, and what transformations have been applied to that data over time, so that you can then use it for your analytics and you really understand what it means.

BL: This allows you to reproduce your data pipelines, understand lineage, and provenance of your data.

CB: Yes. You really want to be able to leverage the other work that other people in the enterprise have been doing. You may not have just one data scientist, you may have hundreds of data scientists. If one data scientist is working on a project to clean up one data set and apply some useful transformations to it, then you really want all of your other data scientists to be able to take advantage of that work.

BL: What you’re saying is, not only am I exposing it to the other data scientists in my company, I’m also exposing it in a way that they can understand what I’ve done to the data.

CB: Exactly, because not all of your data scientists are doing the exact same project. They don’t necessarily want the exact same transformations. They just want to be able to see what has been done by everyone else and leverage that work if it applies to their particular project.

BL: What are the tools for making that happen? Generally, what would the tools look like that explain to my colleagues what I have done to the data?

CB: I think there are a lot of tools that address parts of this issue. This catalog, as I’m describing it, doesn’t really exist. One of the reasons we’ve gotten interested in this problem is because this is something I see repeatedly when I go to customer sites — it takes them a really long time to find the data to get started.

Catalogs that go across systems and that capture all of the metadata, including comments and collaborative features — I don’t think that something like that exists in enterprise today.

BL: There’s an open source academic project right now that’s just starting, by Joe Hellerstein, one of the co-founders of Trifacta that attempts to do this; that’s something you can share across different frameworks.

Obviously, you folks at Tamr are thinking about this issue, and other folks are thinking about it, so it must be because companies and enterprises are struggling with this or asking for this.

CB: Yes, definitely. One of the things that happens a lot in meetings is that we’ll show one of the initial screens of the Tamr Connect product, which just has a list of the sources that you’ve connected to the system, and sometimes we get a very emotional reaction, where people say something like: “Oh, this is what I want. I just want a list of the sources that I have.”

My initial reaction was, “Wait, really?” This is not supposed to be the screen that you get excited about. This is just a screen that says, “This is the data that we’re working with.”

That’s a problem that people are increasingly focused on, especially as more and more people have initiatives where they’re trying to bring data into the data lake, and they’re realizing that as they do that, they’re losing a lot of context.

BL: What is Tamr’s free Catalog software, and how does it solve problems for discovering, organizing, and visualizing data?

CB: Customers often ask us: how do we integrate our data sets and make them into the type of data set that a data scientist would want to use? The Catalog product is a very lightweight, easy-to-use web application that is focused on the problem of discovering, organizing, and visualizing the data in the enterprise.

The organization can share Catalog as a centralized repository of metadata, and then associate human knowledge with that data. It’s focused on connecting to as many systems as possible, and then capturing collaborative insights.

BL: Is machine learning part of the catalog software?

CB: The Catalog software is more focused on profiling, and right now doesn’t have a whole lot of machine learning. Right now, Catalog is about connecting to a lot of systems, giving you a list of sources, and then helping you get value out of that list of sources, so you can figure out which attributes can be associated with a particular entity.

The vision going forward is that it will be more connected with Tamr’s Connect product — there will be good integration between the two, and there will certainly be machine learning components in both.

BL: Have you gone to some of your customers and talked to them about Catalog? If so, what kind of response are they seeing?

CB: Yes, we actually already have 550 registrants and over 300 users from a wide variety of organizations. We’ve been working really closely with some of these Alpha customers to make sure that what we’re building is solving their problems. We’ve had a lot of really good feedback about being able to increase data awareness and the level of organization of data in the enterprise.

BL: What do companies have in existence right now to solve the problems we’ve been discussing?

CB: A lot of the tools that are being used to solve these problems are bigger, top-down, rule-based, data-governance approaches, that are maintained by a small number of administrators and start to break down at scale. The core Tamr difference is that Catalog can connect to any system. It’s free, so we’re really interested in driving adoption across the enterprise.

Catalog is really only part of the solution. We start with Catalog, and then once you see where all the data is across your enterprise, and you see how this data relates to the projects that you want to do, then you can use the Tamr Connect product to connect those data sources and produce those data sets that are useful for your analytics.

BL: Data sources could run the gamut of different types of systems, right?

CB: Yes, you can download Tamr Catalog on your laptop to try it out, and connect to any XML, CSV, and Excel files that are on your laptop. You can connect to various databases, as well. The vision of the product is to connect to everything. We want to integrate with everything across the enterprise and really embrace that variety.

You can download Tamr’s Catalog product for free here, and you can listen to the complete podcast episode in the player below.

This post is a collaboration between O’Reilly and Tamr. See our statement of editorial independence.

Cropped image on article and category pages via Nevit Dilmen on Wikimedia Commons.

Integrate, catalog, and preserve metadata

Dr. Clare Bernard, former particle physicist at CERN, on solutions for discovering, organizing, and visualizing enterprise data.

Get the O’Reilly Data Newsletter