"file formats" entries

The problem of managing schemas

Schemas inevitably will change — Apache Avro offers an elegant solution.

filing_cabinets_foam_Flickr

When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them. Read more…