The problem of managing schemas

When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them.

But there are also significant drawbacks to this approach, and often these drawbacks only become apparent over time, when it can be challenging to modify the file formats across the entire system.

Part of the problem is performance — text formats have to be parsed every time they are processed. Data is typically written once but processed many times; text formats add a significant overhead to every data query or analysis.

But the worst problem by far is the fact that with CSV and JSON data, the data has a schema, but the schema isn’t stored with the data. For example, CSV files have columns, and those columns have meaning. They represent IDs, names, phone numbers, etc. Each of these columns also has a data type: they can represent integers, strings, or dates. There are also some constraints involved — you can dictate that some of those columns contain unique values or that others will never contain nulls. All this information exists in the head of the people managing the data, but it doesn’t exist in the data itself.

The people who work with the data don’t just know about the schema; they need to use this knowledge when processing and analyzing the data. So the schema we never admitted to having is now coded in Python and Pig, Java and R, and every other application or script written to access the data.

And eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.

There is a better way of doing things.

Apache Avro is a data serialization project that provides schemas with rich data structures, compressible file formats, and simple integration with many programming languages. The integration even supports code generation — using the schema to automatically generate classes that can read and write Avro data.

Schema changes happen frequently, and often without warning.Since the schema is stored in the file, programs don’t need to know about the schema in order to process the data. Humans who encounter the file can also easily extract the schema and better understand the data they have.

When the schema inevitably changes, Avro uses schema evolution rules to make it easy to interact with files written using both older and newer versions of the schema — default values get substituted for missing fields, unexpected fields are ignored until they are needed, and data processing can proceed uninterrupted through upgrades. When starting a data analysis project, most developers don’t think about how they’ll be able to handle gradual application upgrades through a large organization. The ability to independently upgrade the applications that are writing the data and the applications reading the data makes development and deployment significantly easier.

The problem of managing schemas across diverse teams in a large organization was mostly solved when a single relational database contained all the data and enforced the schema on all users. These days, data is not nearly as unified — it moves between many different data stores, structured, unstructured or semi-structured. Avro is a very versatile and convenient way of bringing order to chaos. Avro formatted data can be stored in files, in unstructured stores like HBase or Cassandra, and can be sent through messaging systems like Kafka. All the while, applications can use the same schemas to read the data, process it, and analyze it — regardless of where and how it is stored.

Decisions made early in the project can come back to bite later. Hadoop offers a rich ecosystem of tools and solutions to choose from, making the decision process more challenging than it was back when data was always stored and processed in relational databases. File formats are no exception — there are probably 10 different file types that are supported through the Hadoop ecosystem. Some of the formats are easy to use by beginners, some offer special performance optimizations for specific use-cases. But for general-purpose data storage and processing, I always tell my customers: just use Avro.

Gwen Shapira will talk more about architectural considerations for Hadoop applications at Strata + Hadoop World Barcelona. For more information and to register, visit the Strata + Hadoop World website.

Cropped image on article and category pages by foam on Flickr, used under a Creative Commons license.

This post is part of our on-going investigation into the evolving, maturing marketplace of big data components.

Related:

Hadoop Application Architectures — early release book by authors Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira.

The problem of managing schemas

Schemas inevitably will change — Apache Avro offers an elegant solution.

Get the O’Reilly Data Newsletter