Strata Week: Harvard Library releases big data for its books

Here’s what caught my attention in the data space this week.

Harvard Library’s metadata

Harvard University announced this week that it would make more than 12 million catalog records from its 73 libraries publicly available. These records contain bibliographic information about books, manuscripts, maps, videos, and audio recordings. The Harvard Library is making these records available under a Creative Commons 0 license, in accordance with its Open Metadata Policy.

The records will be available for download from Harvard and via an API from the Digital Public Library of America (DPLA), an initiative that’s aiming to build an online national public library. The records released from Harvard are in the MARC21 format and include information that describes the various works — author, title, publisher, data, subject headings.

“This is Big Data for books,” David Weinberger, co-director of Harvard’s Library Lab told The New York Times’ Quentin Hardy. “There might be 100 different attributes for a single object.”

The hope is that by making the metadata openly available, other libraries will follow suit and developers will be able to build new applications. “By instituting a policy of open metadata, the Harvard Library has expressed its appreciation for the great potential that library metadata has for innovative uses,” said Stuart Shieber, Library Board Member and and Professor of Computer Science at Harvard in the press release.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

CDH4

Cloudera has released the latest beta version of its Hadoop distribution: CDH4. It offers upgrades to Flume, Sqoop, Hue, Oozie and Whirr, and support for new versions of Red Hat, Centos, SUSE, Ubuntu and Debian.

Cloudera says CDH4 has a great many enhancements over CDH3, including better availability, utilization, extensibility and security. The new version also contains a “significantly redesigned MapReduce.” However, Cloudera says it plans to support both generations of MapReduce for the life of CDH4.

A big data IPO

The “operational intelligence” company Splunk had its IPO this past week. As Forbes writer Josh Bersin noted, the initial offering was hot, coming in with “a valuation at 28X revenue ($3.2 billion). This valuation trumps the hot companies in social networking: Jive trades at 20X revenue, Google trades at 5X revenue, and Facebook, well we’ll see.” Bersin argues that “big data” is “big news” and “big business,” and he points to several things that the IPO and the market’s response point to for HR and talent management, including the observation that “most businesses today have plenty of data with which to make decisions.”

Got data news?

Feel free to email me.

Photo: Harvard College Library bookplate with withdrawal stamp by kladcat, on Flickr

Related:

Strata Week: Harvard Library releases big data for its books

Harvard offers big data for books, Cloudera's new Hadoop distribution, Splunk goes public.

Harvard Library’s metadata

CDH4

A big data IPO

Got data news?

Get the O’Reilly Data Newsletter