Data Transformation

By Joe Hellerstein and Jeff Heer

As data processing has become more sophisticated, there has been little progress on improving the most time-consuming and tedious parts of the pipeline: Data Transformation tasks including discovery, structuring, and content cleaning . In standard practice, this kind of “data wrangling” requires writing idiosyncratic scripts in programming languages such as Python or R, or extensive manual editing using interactive tools such as Microsoft Excel. The result has two significantly negative outcomes. First, people with highly specialized skills (e.g., statistics, molecular biology, micro-economics) spend far more time in tedious data wrangling tasks than they do in exercising their specialty. Second, less technical users are often unable to wrangle their own data. The result in both cases is that significant data is often left unused due to the hurdle of transforming it into shape. Sadly, when it comes to standard practice in modern data analysis, “the tedium is the message.” In our upcoming tutorial at Strata, we will survey both sources and solutions to the problems of Data Transformation.

Analysts must regularly transform data to make it palatable to databases, statistics packages, and visualization tools. Data sets also regularly contain missing, extreme, duplicate or erroneous values that can undermine the results of analysis. These anomalies come from various sources, including human data entry error, inconsistencies between integrated data sets, and sensor interference. Our own interviews with data analysts have found that these types of transforms constitute the most tedious component of their analytic process. Flawed analyses due to dirty data are estimated to cost billions of dollars each year. Discovering and correcting data quality issues can also be costly: some estimate cleaning dirty data to account for 80 percent of the cost of data warehousing projects.

Part of the problem is that reformatting and validating data requires transforms that can be difficult to specify and evaluate. For instance, analysts often split data into meaningful records and attributes — or validate fields such as dates and addresses — using complex regular expressions that are error-prone and tedious to interpret. Converting coded values, such as mapping FIPS codes to U.S. state names, requires integrating data from one or more external tables. As data sets grow in size and variety, discovering data quality issues may be as difficult as correcting them.

Misformatted data is rampant: data exists in a variety of encodings, yet common data tools often expect rigid adherence to specific formats or protocols. Fortunately, many recurring formatting challenges can be addressed with a few complementary transformation actions. We will discuss how a small set of operations can be used to transform data in big ways. We will also review techniques for interactive transformation, including human-guided inference techniques for learning data formats on the fly. With just a few well-chosen examples, automated systems can make data transformation far more efficient, accurate, and satisfying.

Other common issues include extreme, erroneous, or inconsistent data values. Extreme values such as outliers can undermine robust analysis. Outliers may reside along one or multiple dimensions, or may be specific to a particular data type or domain. Erroneous data can arise because of error during data entry, measurement, or distillation. Obviously, analysis of incorrect data can lead to incorrect conclusions. Inconsistent data refers to variable encodings of the same value. Examples include variations in spelling or formatting, measurement units, or coding schemes (e.g., names vs. abbreviations). We will review both automated and interactive approaches for detecting — and where possible, correcting — these forms of anomalous data.

Despite years of computer science research on these problems, dealing with dirty data has resisted automation. Accordingly, we will give equal attention to visualization and interaction techniques that help informed users better understand and transform their data. Our tutorial will cover methods for assessing and transforming dirty data, and we will demonstrate how these ideas come together in the Trifacta product. We’ll examine how a novel balance of visual interfaces, algorithmic support and scalable processing can radically accelerate data transformation. Participants should expect to learn new insights and approaches for making their data more useful and actionable.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata in Santa Clara: February 11-13 | Santa Clara, CA

Data Transformation

Skills of the Agile Data Wrangler

Get the O’Reilly Data Newsletter