Strata Gems: Where to find data

Starting points for data markets and open data

We’re publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: Quick starts for charts.

Strata 2011With the growth of both the open data movement and data marketplaces, there’s now a wealth of public data – some free, some for sale – that you can use in your analyses and applications. It’s not just about data dumps: increasingly you can get data through APIs, or even execute it on servers provided by the data host.


An icon of the open data movement, Freebase is a graph database full of “people, places and things”. The data is contributed and edited by community volunteers. Freebase was recently acquired by Google.

Freebase both names real world entities, and stores structured data about the attributes of and relations between those things. For example, see the page for the movie Harry Potter and the Deathly Hallows: Part I. It looks a bit like a Wikipedia page, but you can edit and retrieve the structured data for every page.

Developers have access to a variety of Freebase services, including dumps from the entire database, and API access to the data. Of particular interest is “Acre”, a hosted platform that lets you implemented an application on Freebase servers, close to the data you need.

Freebase screenshot
Screenshot from Freebase, showing activity in the most popular data sets

Amazon Public Data Sets

As a public service, Amazon Web Services host a variety of Public Data Sets available to users writing applications on their cloud services. By putting the data on servers next to their cloud computing platform EC2, Amazon helps avoid the difficulty of locating, cleaning and downloading data. The data never needs to travel: only your code. This is obviously valuable when data sets get particuarly large, or are updated frequently.

Amazon’s public data sets include annotated human genome data, a variety of data sets from the US Census Bureau, and dumps from services such as Freebase and Wikipedia.

Windows Azure Data Market

Launched publicly by Microsoft this year, the Azure Data Market offers a variety of data sets and sources, accessible by the OData protocol. OData offers uniform access to data, along with a standardized query interface. By using data from the market, a user can reduce the friction of parsing and importing data. Unsurprisingly, Microsoft’s own tools such as Excel allow importing of data directly from the marketplace’s OData endpoints.

Azure Data Market contains both free and for-pay data sets, offering a route to monetization for data publishers. Free data sets include government and international agency data. An example of for-pay data, Sports data provider MLB game by game statistics through the marketplace.

The emergence of data marketplaces offers developers a legitimate route to data previously only obtainable at high cost, or through illicit web scraping.

Yahoo! Query Language (YQL)

YQL is a technology that presents web services in way in which familiar SQL-like queries can be executed against them. SELECT, INSERT and DELETE operations can be performed against services such as Flickr.

In essence, YQL offers a technology similar to OData, providing an adapter layer that gives data consumers a uniform interface to data. Data providers must provide their data as an Open Data Table: or third parties can contribute adapter definitions, such as those for Foursquare, Github, and Google. The most limiting aspect of YQL is that queries must run through Yahoo’s own servers.


Infochimps is another data market place and commons, founded by Strata speaker Flip Kromer.

Infochimps makes its data available either as downloadable data sets, or accessible via an API. For an example of commercial data available on Infochimps, check out the Twitter Census Conversation Metrics, which counts the occurrence of URLs, hashtags and Smileys used over a year in Twitter.


A previous Strata Gem covered the use of Wikipedia as training data, but there’s more than just free text content inside Wikipedia: many articles contain structured information. DBPedia is a community led project to extract this structured information and make it available on the web.

DBpedia offers a variety of data sets, covering entities such as cities, countries, politicians, films and books. The data is available as dumps, queryable online or available as crawlable linked data in RDF format.

tags: , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Erik Hooder

    Get a load of how Government 2.0 is working:

  • Also check out Factual — we’re an open platform that hosts hundreds of thousands of datasets in dozens of categories. We recently launched local data for more than 14 million US businesses, and for millions of businesses in the UK, Japan, Italy, Indonesia, and Australia. We also have premium datasets in a variety of other categories (health, education, entertainment, government). Feel free to check out and play around with our data, which you can do either through our JavasScript or server-side APIs.

  • A good set of references provided here, thank you.

    (One likely typo, however: I don’t think “analyses” is a real word, or if so, is misused in your first paragraph. Might update this to “analysis” instead ;-)


  • The following blog post attempts to categorize the various Data-as-a-Service offerings. It adds a few more sources and interesting initiatives in the space for those that are interested:

    Data-as-a-Service: Market Defenitions

  • There’s also CKAN, which powers and over 20 data catalogues around the world. It pulls in information about open government data from numerous sources, and is currently being used as the basis for work on a pan-European open data catalogue.


  • The Sunlight Foundation hosts the National Data Catalog, a comprehensive resource for government data across all levels: federal, state, and municipal.

  • No mention of – a community maintained catalogue with over 1500 datasets listed.

  • Structured Web

    We are using Freebase’s data dumps to showcase our technology of making information search extremely easy. Have a look: