Strata Week: How Facebook moved 30 petabytes of Hadoop data

Here are a few of the data stories that caught my attention this week.

Moving an elephant: How Facebook moved to a new datacenter

Migrating data to a new system is always a hassle. But when you’re Facebook and you’re dealing with data on a petabyte scale — and with the impossibility, really, of downtime — it’s more than a hassle. It’s a huge engineering challenge. But that’s what Facebook recently undertook when it migrated it’s Hadoop deployment to a new datacenter.

Moving the actual machines themselves to the new datacenter wasn’t an option. Facebook’s Paul Yang described the process that the team developed in order to replicate 30 petabytes of its Hadoop cluster to the new location, noting the challenges due to it being a live file system on a massive scale.

Once the required systems were developed, the replication approach was executed in two steps. First, a bulk copy transferred most of the data from the source cluster to the destination. Yang wrote:

Most of the directories were copied via DistCp — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. Our Hadoop engineers made code and configuration changes to handle special cases with Facebook’s dataset, including the ability for multiple mappers to copy a single large file, and for the proper handling of directories with many small files. After the bulk copy was done, file changes after the start of the bulk copy were copied over to the destination cluster through the new replication system. File changes were detected through a custom Hive plug-in that recorded the changes to an audit log. The replication system continuously polled the audit log and copied modified files so that the destination would never be more than a couple of hours behind. The plug-in recorded Hive metadata changes as well, so that metadata modifications such as the last accessed time of Hive tables and partitions were propagated. Both the plug-in and the replication system were developed in-house by members of the Hive team.

Yang said the speed of the replication system was key, as it kept downtime to a minimum and made the identification and repair of any corrupt files possible without falling behind on schedule. He also noted that the system Facebook devised pointed to a potential disaster-recovery system using Hive. “With replication deployed, operations could be switched over to the replica cluster with relatively little work in case of a disaster. The replication system could increase the appeal of using Hadoop for high-reliability enterprise applications.”

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Ex-NASA CTO Launches Nebula

At OSCON this week, former NASA CTO Chris Kemp announced his new company, Nebula, which will sell an Open Stack-based appliance that will enable any company to implement cloud computing. Nebula builds upon Open Compute, the infrastructure project that Facebook open sourced earlier this year. Nebula shares a name with the computing project that NASA open-sourced last year as part of the initial Open Stack initiative, and the new startup aims to offer a turnkey solution to help companies implement Open Stack.

As Kemp told O’Reilly Radar’s Alex Howard in an interview:

As people face this industrial revolution of big data, they can’t use Oracle anymore. It doesn’t scale. We want to be the platform that enables that. We really believe that, if all of this stuff will achieve its potential, in being open, it will reshape the core of computing. We really think there’s this new paradigm of computing where people are building on top of infrastructure services instead of infrastructure.

In announcing the startup at OSCON, Kemp spoke of the democratizing power of Nebula, putting this big data computing power in the hands of everyone, not just large companies with massive infrastructure.

Liking and linking library data

GlueJar‘s Eric Hellman continues his blog series on libraries, data, and search engines with a post on “Liking Library Data. He offers thoughts on how to implement Facebook’s Open Graph Protocol on library sites, not just so that visitors can “like” the library’s website, of course, but so that books can be tied to individual social graphs.

There’s the caveat of course:

Once you put the like button javascript on a web page, Facebook can track all the users that visit that page. This goes against the traditional privacy expectations that users have of libraries. In some jurisdictions, it may even be against the law for a public library to allow a third party to track users in this way.

But Hellman argues that it’s important that library resources become more fully integrated with social networks — it’s about “connections, not just collections,” he says.

Got data news?

Feel free to email me.

Related:

More Strata Week coverage

Strata Week: How Facebook moved 30 petabytes of Hadoop data

Facebook migrates to a new datacenter, Ex-NASA CTO launches Nebula

Moving an elephant: How Facebook moved to a new datacenter

Ex-NASA CTO Launches Nebula

Liking and linking library data

Got data news?

Get the O’Reilly Data Newsletter