Strata Gems: Where to find data

Starting points for data markets and open data

We’re publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: Quick starts for charts.

Strata 2011With the growth of both the open data movement and data marketplaces, there’s now a wealth of public data – some free, some for sale – that you can use in your analyses and applications. It’s not just about data dumps: increasingly you can get data through APIs, or even execute it on servers provided by the data host.

Freebase

An icon of the open data movement, Freebase is a graph database full of “people, places and things”. The data is contributed and edited by community volunteers. Freebase was recently acquired by Google.

Freebase both names real world entities, and stores structured data about the attributes of and relations between those things. For example, see the page for the movie Harry Potter and the Deathly Hallows: Part I. It looks a bit like a Wikipedia page, but you can edit and retrieve the structured data for every page.

Developers have access to a variety of Freebase services, including dumps from the entire database, and API access to the data. Of particular interest is “Acre”, a hosted platform that lets you implemented an application on Freebase servers, close to the data you need.

Freebase screenshot
Screenshot from Freebase, showing activity in the most popular data sets

Amazon Public Data Sets

As a public service, Amazon Web Services host a variety of Public Data Sets available to users writing applications on their cloud services. By putting the data on servers next to their cloud computing platform EC2, Amazon helps avoid the difficulty of locating, cleaning and downloading data. The data never needs to travel: only your code. This is obviously valuable when data sets get particuarly large, or are updated frequently.

Amazon’s public data sets include annotated human genome data, a variety of data sets from the US Census Bureau, and dumps from services such as Freebase and Wikipedia.

Windows Azure Data Market

Launched publicly by Microsoft this year, the Azure Data Market offers a variety of data sets and sources, accessible by the OData protocol. OData offers uniform access to data, along with a standardized query interface. By using data from the market, a user can reduce the friction of parsing and importing data. Unsurprisingly, Microsoft’s own tools such as Excel allow importing of data directly from the marketplace’s OData endpoints.

Azure Data Market contains both free and for-pay data sets, offering a route to monetization for data publishers. Free data sets include government and international agency data. An example of for-pay data, Sports data provider Stats.com MLB game by game statistics through the marketplace.

The emergence of data marketplaces offers developers a legitimate route to data previously only obtainable at high cost, or through illicit web scraping.

Yahoo! Query Language (YQL)

YQL is a technology that presents web services in way in which familiar SQL-like queries can be executed against them. SELECT, INSERT and DELETE operations can be performed against services such as Flickr.

In essence, YQL offers a technology similar to OData, providing an adapter layer that gives data consumers a uniform interface to data. Data providers must provide their data as an Open Data Table: or third parties can contribute adapter definitions, such as those for Foursquare, Github, and Google. The most limiting aspect of YQL is that queries must run through Yahoo’s own servers.

Infochimps

Infochimps is another data market place and commons, founded by Strata speaker Flip Kromer.

Infochimps makes its data available either as downloadable data sets, or accessible via an API. For an example of commercial data available on Infochimps, check out the Twitter Census Conversation Metrics, which counts the occurrence of URLs, hashtags and Smileys used over a year in Twitter.

DBpedia

A previous Strata Gem covered the use of Wikipedia as training data, but there’s more than just free text content inside Wikipedia: many articles contain structured information. DBPedia is a community led project to extract this structured information and make it available on the web.

DBpedia offers a variety of data sets, covering entities such as cities, countries, politicians, films and books. The data is available as dumps, queryable online or available as crawlable linked data in RDF format.

tags: , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.