Data as a service

A look at how services and widgets are democratizing data and visualization.

The last two months have seen some important developments in the way data is made available. First, Infochimps created a web API for publishing data. The number of datasets is relatively limited; there are five available now, of which four have to do with Twitter data, and one maps IP addresses to census data (and that one appears not to be available yet). Their site allows you to request (or vote on requests) for new datasets. Pricing is reasonable. You can do significant experimentation, or even run a useful low-volume application, without running up any charges.

“Data as a service” is not a new term, by any means. There have been any number of data services over the years. But this is something different from the many services that have sold data — or even the more recent services that have sold data via the Internet. Data as a service is another part of the cloud computing alphabet soup, on par with “infrastructure, software, or platform as a service” (IaaS/SaaS/PaaS). Infochimps makes possible applications where data lives in the cloud. Granted, you’re not going to access terabyte datasets over the Internet. But neither do you have to download (or have shipped) a giant dataset for the few Kilo- or Megabytes that interest you. Infochimps is pushing a bit beyond simple data access. Their Twitter APIs aren’t raw data, but implement trust metrics, influence metrics, and more. So perhaps it’s better to call this “algorithm as a service” (AaaS), not unlike the Prediction API (machine learning using Google’s algorithms) that was announced at Google I/O.

The second new data service that has impressed me is Google’s new Public Data Explorer. I assume that everyone reading this article has seen the latest spectacular data visualizations, in the New York Times, Nathan Yau’s Flowingdata blog, and elsewhere. Here’s one example from GE (created by Ben Fry’s Fathom Information Design). Public Data Explorer lets you create your own visualizations, based on Google’s data.

Here’s one of their examples (nicer than anything I came up with on the fly). It’s an animation of per-capita income in California counties that shows how the individual counties have fared from 1969 to 2007. I’ve highlighted a few interesting counties — let’s see how they perform:

Not surprisingly, the difference between the richest and poorest counties has drastically increased. Google provides many datasets, and gives you interesting ways to arrange and animate the data. I’ve displayed a fairly simple bar graph animation, but you can also do bubbles on a map and several kinds of Cartesian plots. You can slice and dice regions in many different ways, frequently down to the county level. They’ve got data from the European community, from Australia, the World Bank, and other sources. None of this is exactly new: the data has been around for years. What Google Data Explorer does is enable you to explore the data yourself and paste the result into your own sites and blogs.



Data books from O’Reilly:

R in a Nutshell
A quick and practical reference to learn what is becoming the standard for developing statistical software.

Statistics in a Nutshell
An introduction and reference for anyone with no previous background in statistics.

Data Analysis with Open Source Tools
This book shows you how to think about data and the results you want to achieve with it.

Programming Collective Intelligence
Learn how to build web applications that mine the data created by people on the Internet.

Beautiful Data
Learn from the best data practitioners in the field about how wide-ranging — and beautiful — working with data can be.

Beautiful Visualization
This book demonstrates why visualizations are beautiful not only for their aesthetic design, but also for elegant layers of detail.

Head First Statistics
This book teaches statistics through puzzles, stories, visual aids, and real-world examples.

Head First Data Analysis
Learn how to collect your data, sort the distractions from the truth, and find meaningful patterns.



Wolfram Alpha’s Widgets provides yet another way to interact with data — potentially the most flexible yet. (You’ll have to create an account and sign in.) Widgets are web components for interacting with Wolfram Alpha’s data back-end. You can do pure Mathematica queries, but you can also interact with the extensive data Wolfram has been collecting. There’s no programming required (unless you want to submit Mathematica queries). There’s a web-based widget builder where you start with an Alpha request, like “US Unemployment,” parameterize it, specify the layout you want and how you want to embed it (lightbox, popup, and iFrame styles are supported), and finally test it. At the end, you get a link or a clump of Javascript that you can paste into a website or a blog, and you can include the widget in a public gallery. You can also post directly to Facebook, Twitter, and most other social sites.

Here’s an example: a simple widget to compare two stocks. You can select your own stocks or just use my defaults (Apple and Google):

It took me about five minutes to whip up this widget, starting with the simple Alpha query “APPL GOOG.” But there are many ways to look up stock prices and histories. What about something more esoteric? Alpha knows an incredible amount. The other day my wife and I couldn’t remember what a half-diminished seventh chord was. Alpha knows, and can show you a piano keyboard, guitar fingerings, and even play the chord. Here’s the result:

You’ll confuse it if you try really odd chords; don’t try fancy jazz ninths and thirteenths, and remember to specify “triad” if you want your basic three-note chord.

Alpha’s weak point is that you frequently end up playing “guess what Alpha wants.” I suppose that’s what you trade off for flexibility; but it was surprisingly difficult to build an interest calculator. Building the widget was simple enough, but coming up with the initial query was difficult. Alpha would either assume I was paying off a loan, or doing a present value calculation, or something else, until I juggled the terms into the right order, which happened to be “10% interest $100 initial value 7 years.” Not illogical, but neither were any of the other attempts.

That’s a minor problem, though. Widgets makes it fun to explore data and computation, and makes it trivial to share the results. With “data as a service” APIs like Infochimps, and embeddable data components like Google Public Data Explorer and WolframAlpha Widgets, we’re seeing the democratization of data and data visualization: new ways to access data, new ways to play with data, and new ways to communicate the results to others.

Related:

tags: , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Eli Rosenberg

    You may also want to take a look at socrata.com – there is an open source API, data from the White House, Medicare (data.medicare.gov), Seattle (data.seattle.gov) , Chicago (data.cityofchicago.org) as well as the EPA. Several federal, state and local government organizations are also coming online soon.

  • J Milne

    See also scraperwiki.com

  • Juergen Brendel

    If you are interested in providing a web API for the publishing of data, you might want to look at RESTx, very quick and easy way to create RESTful web services.

    The idea is that developers can provide components that implement access/analysis/integration of data (from databases, cloud or proprietary systems) and that knowledge workers (or the public even) can easily create new RESTful web services by providing sets of parameters to those components. Each parameter set gets its own URI, thus allowing the very quick creation of new services.

    All data is available as HTML or JSON, depending on the client request.

  • Atul Kedar

    There are three big issues with data.gov usability

    1. Lack of versioning of datasets. Data owners may at anytime update data or change format.Once a data is placed on data.gov it should be locked and any updates to the dataset must be available as a link accompanied with update reason. This will alow data consumers to access data from data.gov directly instead of downloading it and manually comparing datasets – which is complex task. This suggestion has performance implications on data.gov, but having an absolute uri to a dataset is important.

    2. Availability of high level API’s. Restful services for consuming data is a better than framework and language dependent web services (as Juergen pointed out in his comments). Rest based services make it easy to consume data by other applications and for creating mashups.

    3. Better searching amongst the dataset. Each dataset must be tagged with enough metadata to make searching easier. e.g. data collection period, data format, disambiguated column headings, semantic tagging. Using keywords to search for potential datasets or clicking through 100 pages looking for a dataset is not practical.This hardship is evident by lack of enthusiasm by the serious data consumers for data.gov. Setting up automatic alerts e.g. Google alerts is a nice way of staying informed of new datasets.

    Atul Kedar
    Architect
    Sapient

  • Dan Wilson

    We’re tracking over a million statistics over at timetric.com - all available for free and with a RESTful API.

  • Pete Soderling

    Great article on data-as-a-service, and from a market perspective think the mention of Infochimps is significant since they’re creating a data marketplace that, if executed properly, gets premium data the ‘final mile’ and into the hands of actual, paying customers.

    At Stratus, we’ve also been thinking a lot about DaaS and premium data, and have published some thoughts on the various pricing model options for premium data distributed via API in case it’s of interest – http://blog.programmableweb.com/2010/08/26/data-as-a-service-pricing-models-for-the-future-of-data/

    Nice work here.

    -pete