As governments and businesses — and increasingly, all of us who are Internet-connected — release data out in the open, we come closer to resolving the tiresomely famous and perplexing quote from Stewart Brand: “Information wants to be free. Information also wants to be expensive.” Open data brings home to us how much free information is available and how productive it is in its free state, but one subterranean thread I found in Joel Gurin’s book Open Data Now highlights an important point: information is very expensive.
In this article, I’ll explore a few themes that piqued my interest in Gurin’s book: the value of open data, the expense it entails, the questions of how much we can use and trust it, and the role the general public and the private sector play in bringing us data’s benefits. This is not meant to be a summary or a review of Gurin’s book; it is an exploration of themes that interest me, inspired by my reading of Gurin.
Open, trustworthy, and useful
“Open data” occupies hierarchies of usefulness. One way of describing its usefulness is the structure of its presentation, as Gurin and others such as Tim Berners-Lee have pointed out. Much data is still fairly unstructured, like the reviews and social media status postings that people generate by the millions and that are funneled into eager consumption by marketing analysts. Some data is more structured, existing as tables. And finally, a tiny fragment can be reached through the RESTful APIs supported by libraries in every modern programming language.
Putting government or business data into a RESTful API takes effort, unless one builds one’s entire IT organization around these APIs, as Jeff Bezos famously did at Amazon.com. Although an API makes it easy to extract particular data items and even filter the data, the value of an API may be overrated. Some people in the open data movement have told me, “Don’t ask organizations to wait until they have an API. Please just get the data out there — Excel, CVS, whatever they have. We’ll slurp it up and do the analysis.”
More important to data’s value is what’s collected and how it’s categorized. Some can be more trusted because it is backed by official bodies or released in compliance with regulations.
User experience with XBRL is instructive. The standard was designed to store corporate reports in a way that supported sophisticated searches and comparisons in Semantic Web fashion, and was adopted as a requirement by many regulatory bodies, including the SEC in the US (now under question from Congress). But businesses and nonprofits trying to extract useful wisdom from the XBRL filings say that it is too complicated (a disease of many XML standards) and lacks some fields they need.
With this handicap, the expense of collecting free data may not pay off as much as we want. Another example is the expensive collection of patient data in electronic records, which are often not detailed or structured enough to support the kinds of analytics that could improve care.
In addition to structure, one must consider accuracy and trustworthiness. One of Gurin’s strengths is the clear-sighted view he brings to these issues. He points out that few successful examples of choice engines exist — these employ algorithms to try to leverage publicly available data to help customers decide what insurance policy, car, college, or other product is right for them. Gurin attributes their business problems mostly to the difficulty of extracting payment for the services, but suggested two other potential show-stoppers. First, different services generate wildly different results, suggesting that the programmers don’t really understand the data and have chosen unsuccessful algorithms. Secondly, the services may be biased, either to please sponsors or just because they can’t get equally complete data on all the products they are supposed to cover.
Further warnings about building businesses on data analysis come in a new study claiming that a recent round of innovative lenders, who try to base credit decisions on “Internet searches, social media, and mobile apps” are actually “riddled with inaccuracies,” sometimes causes by combining data from different people. The algorithms used must be incredibly complex: “Instead of evaluating potential borrowers based on a FICO score, which uses 10-15 variables to arrive at its score, ZestFinance renders a credit decision after analyzing thousands of variables” which are run through “ten separate models.”
In my take, the problems of quality data multiply along with the number of sources. Most analysts try to pull in data of different types, often collected in different ways by different organizations. But each data set reveals its secrets in different ways. As a simple example, most social scientists know that the median tends to be a better way than the mean to take an average. The mean gives unwarranted weight to outliers, whereas the median shows the trends that affect most people more accurately. Analysts using even more sophisticated schemes for filtering, classification, and ranking need to make a dozen such decisions for every data set. When one tries to make sense of thousands of data sets at once, how do you even know how accurate you are?
Open data takes money and effort
Although Gurin does not address the cost of providing data explicitly, many of his examples clearly required a big investment. Everybody’s favorite open data project, the NASA GPS system, is staggeringly costly. How much does it take to launch and maintain 32 satellites? How much do telecom companies invest to blanket the landscape with cell phone towers?
Even things we consider data exhaust (by-products of other activities) may be expensive to collect, such as sales tax payments that are a useful indicator of the ups and downs of various businesses. Someone had to ring up all those receipts and report them to the government.
The more structure one wants to impose, such as using XBRL or offering an API, the more that adds to the cost of data.
Cost can perhaps be ignored when individuals donate their time to collect and record data, as they do for OpenStreetMap, Wikipedia, social network status messages, and customer ratings. But this sort of data is also the least trustworthy, as Gurin points out. Usually one tries to solve the trust problem with redundancy: asking many different people to rate the same restaurant or classify the same photo, asking fans to check Wikipedia sites, and so on. Somebody is still taking time and effort to ensure the quality of data.
The cost of data is a great argument for open data. Reliable studies, cited by Gurin, suggest that clever uses of open data add tens of billions of dollars to the US and European Union economies. By opening data, we allow more and more people to create new value and thus justify the cost of collection.
Starting with the stakeholders
One of Gurin’s most interesting themes, which is worth further investigation, is the importance of getting input from the expected users of data concerning what data to release and how. Now, one of the great virtues of openness is that one doesn’t know who might use the material released, and wonderful unanticipated benefits may accrue. But planning ahead can still be important.
A famous example of an unanticipated bad outcome is the abuse of land records in the Tamil Nadu state of India. Here, open data ended up privileging an incomplete view of land ownership, and hence privileging corrupt rich people who could stomp all over the rights of the poor who lacked official records.
The whole incident played out as a typical misdirected development project, thought up in boardrooms far from the people it was meant to help. I doubt that the choice of data to put up was worked out in consultation with the poor residents who needed a firmer hold on their land. If the World Bank and the Tamil Nadu government had allowed the residents to drive the release of data, it would have pursued the data that legitimized the just claims of the people working the land.
The principle of working with users applies to everybody on whom open data can have an impact, but the stakeholders Gurin speaks about are mostly businesses that can use data to create new services or improve their decision-making. This fuzzy shuttling between public interest and business interests characterizes the book, which is billed as a business book but really offers a lot for other readers as well. Working as I do for a publisher, I can imagine a marketing discussion a couple years ago at McGraw-Hill Education: so, Joel Gurin wants to write a book about the release of data by governments and its effects on society. “We don’t know how to sell that, Joel, but try aiming it at business managers.”
And the book may indeed prompt some managers to retool their strategies, but nearly all of it is more broadly applicable, too. In this case, what’s good for General Motors is good for personal freedom, for democratic discourse, and for the effective use of public resources. How can governments, businesses, and public interest advocates further exploit this overlap in interests?
Open data could perhaps turn into a partnership, building with conscious intent on Tim O’Reilly’s government-as-platform idea. Data starts out being useful to the governments that collect it, then in its open form contributes further improvements to our lives through innovative business uses, but may reach its greatest potential if businesses partner with government.
The header on Challenge.gov reads, “A partnership between the public and the government to solve important challenges.” And these kinds of challenges do demonstrate a productive way for governments to say, “This is what we need. Can you bring it to life?” The FOIA requests described by Gurin also represent an engagement between business and government, though in a rather confrontational form.
The open source software movement — which doesn’t make many appearances in Open Data Now — was an early example of collaboration around open data and still provides impressively sophisticated models for a partnership of equals. Successful free software projects, each in its own way, manage to combine what every participant has to offer as coder, committer, tester, documenter, or advocate.
Some are highly structured, with only a few points where a go/no-go decision can be made, while others take a thousand-flowers-bloom attitude. Some are arranged around one or two developers, whereas others — such as OpenStack — spawn so many sub-projects run by different organizations that hardly anyone can remember them all. Some adhere to strict timelines, whereas others wait until everything seems to gel of its own accord.
So, we may have only glimpsed the start of organizing models for open data. What other modes of collaboration can we find?