The problem of managing schemas

Schemas inevitably will change — Apache Avro offers an elegant solution.

filing_cabinets_foam_Flickr

When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them.

But there are also significant drawbacks to this approach, and often these drawbacks only become apparent over time, when it can be challenging to modify the file formats across the entire system.

Part of the problem is performance — text formats have to be parsed every time they are processed. Data is typically written once but processed many times; text formats add a significant overhead to every data query or analysis.

But the worst problem by far is the fact that with CSV and JSON data, the data has a schema, but the schema isn’t stored with the data. For example, CSV files have columns, and those columns have meaning. They represent IDs, names, phone numbers, etc. Each of these columns also has a data type: they can represent integers, strings, or dates. There are also some constraints involved — you can dictate that some of those columns contain unique values or that others will never contain nulls. All this information exists in the head of the people managing the data, but it doesn’t exist in the data itself.

The people who work with the data don’t just know about the schema; they need to use this knowledge when processing and analyzing the data. So the schema we never admitted to having is now coded in Python and Pig, Java and R, and every other application or script written to access the data.

And eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.

There is a better way of doing things.

Apache Avro is a data serialization project that provides schemas with rich data structures, compressible file formats, and simple integration with many programming languages. The integration even supports code generation — using the schema to automatically generate classes that can read and write Avro data.

Schema changes happen frequently, and often without warning.Since the schema is stored in the file, programs don’t need to know about the schema in order to process the data. Humans who encounter the file can also easily extract the schema and better understand the data they have.

When the schema inevitably changes, Avro uses schema evolution rules to make it easy to interact with files written using both older and newer versions of the schema — default values get substituted for missing fields, unexpected fields are ignored until they are needed, and data processing can proceed uninterrupted through upgrades. When starting a data analysis project, most developers don’t think about how they’ll be able to handle gradual application upgrades through a large organization. The ability to independently upgrade the applications that are writing the data and the applications reading the data makes development and deployment significantly easier.

The problem of managing schemas across diverse teams in a large organization was mostly solved when a single relational database contained all the data and enforced the schema on all users. These days, data is not nearly as unified — it moves between many different data stores, structured, unstructured or semi-structured. Avro is a very versatile and convenient way of bringing order to chaos. Avro formatted data can be stored in files, in unstructured stores like HBase or Cassandra, and can be sent through messaging systems like Kafka. All the while, applications can use the same schemas to read the data, process it, and analyze it — regardless of where and how it is stored.

Decisions made early in the project can come back to bite later. Hadoop offers a rich ecosystem of tools and solutions to choose from, making the decision process more challenging than it was back when data was always stored and processed in relational databases. File formats are no exception — there are probably 10 different file types that are supported through the Hadoop ecosystem. Some of the formats are easy to use by beginners, some offer special performance optimizations for specific use-cases. But for general-purpose data storage and processing, I always tell my customers: just use Avro.

Gwen Shapira will talk more about architectural considerations for Hadoop applications at Strata + Hadoop World Barcelona. For more information and to register, visit the Strata + Hadoop World website.

Cropped image on article and category pages by foam on Flickr, used under a Creative Commons license.

This post is part of our on-going investigation into the evolving, maturing marketplace of big data components.

Related:

tags: , , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • BMGM

    Why not store the metadata with the data–a self-describing data file? Sure, saving the metadata adds a little to the file size, but the trade-off is worth it for safety. For applications where mis-interpreting a field variable can be catastrophic, some communities have come together to share data standards and open-source tools, e.g. NetCDF (common data format).

    • That is basically how Avro works.

  • TatuSaloranta

    Minor nitpick: statement “text formats have to be parsed every time they are processed.” makes no sense, since Avro, like any other data format, also has to be “parsed” (that is, decoded) — it does not exist in process memory space and data needs to be extracted from there not unlike what needs to be done with textual formats. It may be more efficient to do that, but the step itself is not missing.

    • Avro has to be de-serialized, but not parsed. Its much faster to count bytes than to process text. Especially when the text represents numbers and dates. This is so true that a benchmark was considered invalid when it compared Hive 0.10+text with Hive 0.12+RC files (Not Avro, but same concept) – its obvious that text will be slower, so it proves nothing about Hive.

      • TatuSaloranta

        I disagree on choice of terminology here; but the main point is just that both have to be processed to extract data. Article seemed to indicate that somehow a whole step was missing — this is simply not true; regardless of what term is used.Parsing, as a term, is vague and often mis-used so whether it is to be used for textual formats like JSON is questionable as well (there is 90% of tokenization, and only parsing would be matching of open/close markers).

  • geek42

    1, i dont think people could totally knew nothing about the change while processing data.

    2, for many small files, it looks like not a good solution since it would costs too many storage space , and its not the only problem, when you want to change the schema, you need to change every data files

    • Many small files are a problem in general, but Avro doesn’t make it worse.

      You don’t need to change the schema in the files at all! Thats the Avro magic I’m recommending.
      You can use a new schema to read existing files that have an older one, and this will work since Avro can evolve schemas for you.

      • geek42

        what i mean is the case that same field name but different data type between the old and new schema.

        of course you could use the last-insert-time to detect whether to use the old or the new schema, but i dont think this is a good solution.

        also i agree with that plain text file which need to be parsed everytime is a night mare

  • Jeremy

    How does Avro differ from Protocol Buffers or Thrift for this sort of application?