Marshal your data with entity resolution

Analytics can make combining or comparing data faster and less painful.

Entity_Resolution_webcastEntity resolution refers to processes that businesses and other organizations have to do all the time in order to produce full reports on people, organizations, or events. Entity resolution can be used, for instance, to:

  • Combine your customer data with a list purchased from a data broker. Identical data may be in columns of different names, such as “last” and “surname.” Connecting columns from different databases is a common extract, transform, and load (ETL) task.
  • Extract values from one database and match them against one or more columns in another. For instance, if you get a party list, you might want to find your clients among the attendees. A police detective might want to extract the names of people involved in a crime report and see whether any suspects are among them.
  • Find a match in dirty data, such as a person whose name is spelled differently in different rows.

Dirty, inconsistent, or unstructured data is the chief challenge in entity resolution. Jenn Reed, director of product management for Novetta Entity Analytics, points out that it’s easy for two numbers to get switched, such as a person’s driver’s license and social security numbers. Over time, sophisticated rules have been created to compare data, and it often requires the comparison of several fields to make sure a match is correct. (For instance, health information exchanges use up to 17 different types of data to make sure the Marcia Marquez who just got admitted to the ER is the same Marcia Marquez who visited her doctor last week.)

Most data users don’t know all these rules for wrangling the data, and would have trouble coding them up. However, there are tools that can apply rules to a wide range of data, and let the user adjust the way the rules are applied. The first step might just be to ask the user what kind of data is in a field. Does the “person” column of a database contain names, arbitrary IDs, or an email address (which is common among websites that require people to use an email address to log in)? Each type of data is cleaned a different way.

Rules for cleaning names, addresses, etc., are not iron-clad. With an entity analytics tool, a sliding scale can be used to tune the amount of fuzziness users wish to allow. This allows the user to determine, for example, whether he’d prefer more false positives or false negatives, or whether it’s necessary to adjust the formulas for data that is more reliable or less reliable. It can also be helpful in allowing a user to indicate how common a value is. For instance, in some data sets, someone living in Sacramento, California, might be unusual enough to enable the software to identify two people in different data sets as the same person. However, if there are a huge number of people in Northern California in a data set, Sacramento might be such a common location that relying on it to make a match is not safe.

Analytics engines can also learn from the data they’re analyzing and refine the rules to apply to a particular data set — Bayesian inference is one famous type of refinement. Analytics can figure out what values are common, what types of errors are common (such as the switched letters in the misspelled word “teh”), and how the data should be distributed.

Unstructured data from social media and websites provides an additional challenge. One solution can be to use a tool that searches it for entities (people, companies, locations, etc.) and enters it into HCatalog and Hadoop as structured data. Unstructured text can then be pre-processed to put it into a structured form, then correlated with structured data, such as master data management (MDM) or customer relationship management (CRM). This pre-processing locates the attributes that form entities within the the text based on annotation dictionaries and algorithms (such as proximity) to extract an entity or series of entities from the text. You can then create an HCatalog table to represent the source of the data, with the entities inserted as records into the table. Samples of the combined data can then usually show instantly whether fields have been combined incorrectly — a social security number with a driver’s license number, for instance.

For more on this topic, check out the upcoming free O’Reilly webcast, Entity Resolution on Hadoop: The Pitfalls of Building It Yourself, presented by Dave Moore, a solutions architect for commercial markets at Novetta Solutions, and Jenn Reed, director of product management at Novetta Entity Analytics.

This post was a collaboration between O’Reilly and Novetta Solutions. See our statement of editorial independence.

tags: , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.