Linking open data to augmented intelligence and the economy

Nigel Shadbolt on AI, ODI, and how personal, open data could empower consumers in the 21st century.

After years of steady growth, open data is now entering into public discourse, particularly in the public sector. If President Barack Obama decides to put the White House’s long-awaited new open data mandate before the nation this spring, it will finally enter the mainstream.

As more governments, businesses, media organizations and institutions adopt open data initiatives, interest in the evidence behind  release and the outcomes from it is similarly increasing. High hopes abound in many sectors, from development to energy to health to safety to transportation.

“Today, the digital revolution fueled by open data is starting to do for the modern world of agriculture what the industrial revolution did for agricultural productivity over the past century,” said Secretary of Agriculture Tom Vilsack, speaking at the G-8 Open Data for Agriculture Conference.

As other countries consider releasing their public sector information as data and machine-readable formats onto the Internet, they’ll need to consider and learn from years of effort at data.gov.uk, data.gov in the United States, and Kenya in Africa.

nigel_shadboltOne of the crucial sources of analysis for the success or failure of open data efforts will necessarily be research institutions and academics. That’s precisely why research from the Open Data Institute and Professor Nigel Shadbolt (@Nigel_Shadbolt) will matter in the months and years ahead.

In the following interview, Professor Shadbolt and I discuss what lies ahead. His responses were lightly edited for content and clarity.

How does your research on artificial intelligence (AI) relate to open data?

AI has always fascinated me. The quest for understanding what makes us smart and how we can make computers smart has always engaged me. While we’re trying to understand the principles of human intelligence and build a “brain in a box, smarter robots” or better speech processing algorithms, the world’s gone and done a different kind of AI: augmented intelligence. The web, with billions of human brains, has a new kind of collective and distributive capability that we couldn’t even see coming in AI. A number of us have coined a phrase, “Web science,” to understand the Web at a systems level, much as we do when we think about human biology. We talk about “systems biology” because there are just so many elements: technical, organizational, cultural.

The Web really captured my attention ten years ago as this really new manifestation of collective problem-solving. If you think about the link into earlier work I’d done, in what was called “knowledge engineering” or knowledge-based systems, there the problem was that all of the knowledge resided on systems on people’s desks. What the web has done is finish this with something that looks a lot like a supremely distributed database. Now, that distributed knowledge base is one version of the Semantic Web. The way I got into open data was the notion of using linked data and semantic Web technologies to integrate data at scale across the web — and one really high value source of data is open government data.

What was the reason behind the founding and funding of the Open Data Institute (ODI)?

The open government data piece originated in work I did in 2003 and 2004. We were looking at this whole idea of putting new data-linking standards on the Web. I had a project in the United Kingdom that was working with government to show the opportunities to use these techniques to link data. As in all of these things, that work was reported to Parliament. There was real interest in it, but not really top-level heavy “political cover” interest. Tim Berners-Lee’s engagement with the previous prime minister led to Gordon Brown appointing Tim and I to look at setting up data.gov.uk, getting data released and then the current coalition government taking that forward.

Throughout this time, Tim and I have been arguing that we could really do with a central focus, an institute whose principal motivation was working out how we could find real value in this data. The ODI does exactly that. It’s got about $16 million of public money over five years to incubate companies, build capacity, train people, and ensure that the public sector is supplying high quality data that can be consumed. The fundamental idea is that you ensure high quality supply by generating a strong demand side. The good demand side isn’t just public sector, it’s also the private sector.

What have we learned so far about what works and what doesn’t? What are the strategies or approaches that have some evidence behind them?

I think there are some clear learnings. One that I’ve been banging on about recently has been that yes, it really does matter to turn the dial so that governments have a presumption to publish non-personal public data. If you would publish it anyway, under a Freedom of Information request or whatever your local legislative equivalent is, why aren’t you publishing it anyway as open data? That, as a behavioral change. is a big one for many administrations where either the existing workflow or culture is, “Okay, we collect it. We sit on it. We do some analysis on it, and we might give it away piecemeal if people ask for it.” We should construct publication process from the outset to presume to publish openly. That’s still something that we are two or three years away from, working hard with the public sector to work out how to do and how to do properly.

We’ve also learned that in many jurisdictions, the amount of [open data] expertise within administrations and within departments is slight. There just isn’t really the skillset, in many cases. for people to know what it is to publish using technology platforms. So there’s a capability-building piece, too.

One of the most important things is it’s not enough to just put lots and lots of datasets out there. It would be great if the “presumption to publish” meant they were all out there anyway — but when you haven’t got any datasets out there and you’re thinking about where to start, the tough question is to say, “How can I publish data that matters to people?”

The data that matters is revealed in the fact that if we look at the download stats on these various UK, US and other [open data] sites. There’s a very, very distinctive parallel curve. Some datasets are very, very heavily utilized. You suspect they have high utility to many, many people. Many of the others, if they can be found at all, aren’t being used particularly much. That’s not to say that, under that long tail, there isn’t large amounts of use. A particularly arcane open dataset may have exquisite use to a small number of people.

The real truth is that it’s easy to republish your national statistics. It’s much harder to do a serious job on publishing your spending data in detail, publishing police and crime data, publishing educational data, publishing actual overall health performance indicators. These are tough datasets to release. As people are fond of saying, it holds politicians’ feet to the fire. It’s easy to build a site that’s full of stuff — but does the stuff actually matter? And does it have any economic utility?

Page views and traffic aren’t ideal metrics for measuring success for an open data platform. What should people measure, in terms of actual outcomes in citizens’ lives? Improved services or money saved? Performance or corrupt politicians held accountable? Companies started or new markets created?

You’ve enumerated some of them. It’s certainly true that one of the challenges is to instrument the effect or the impact. Actually, it’s the last thing that governments, nation states, regions or cities who are enthused to do this thing do. It’s quite hard.

Datasets, once downloaded, may then be virally reproduced all over the place, so that you don’t notice it from a government site. One of the requirements in most of the open licensing which is so essential to this effort is usually has a requirement for essential attribution. Those licenses should be embedded in the machine readable datasets themselves. Not enough attention is paid to that piece of process, to actually noticing when you’re looking at other applications, other data and publishing efforts, that attribution is there. We should be smarter about getting better sense from the attribution data.

The other sources of impact, though: How do you evidence actual internal efficiencies and internal government-wide benefits of open data? I had an interesting discussion recently, where the department of IT had said, “You know, I thought this was all stick and no carrot. I thought this was all in overhead, to get my data out there for other people’s benefits, but we’re now finding it so much easier to re-consume our own data and repurpose it in other contexts that it’s taken a huge amount of friction out of our own publication efforts.”

Quantified measures would really help, if we had standard methods to notice those kinds of impacts. Our economists, people whose impact is around understanding where value is created, really haven’t embraced open markets, particularly open data markets, in a very substantial way. I think we need a good number of capable economists pilling into this, trying to understand new forms of value and what the values are that are created.

I think a lot of the traditional models don’t stand up here. Bizarrely, it’s much easier to measure impact when information scarcity exists and you have something that I don’t, and I have to pay you a certain fee for that stuff. I can measure that value. When you’ve taken that asymmetry out, when you’ve made open data available more widely, what are the new things that flourish? In some respects, you’ll take some value out of the market, but you’re going to replace it by wider, more distributed, capable services. This is a key issue.

The ODI will certainly be commissioning and is undertaking work in this area. We published a piece of work jointly with Deloitte in London, looking at evidence-linked methodology.

You mentioned the demand-side of open data. What are you learning in that area — and what’s being done?

There’s an interesting tension here. If we turn the dial in the governmental mindset to the “presumption to publish” — and in the UK, our public data principles actually embrace that as government policy — you are meant to publish unless there’s an issue in personal information or national security why you would not. In a sense, you say, “Well, we just publish everything out there? That’s what we’ll do. Some of it will have utility, and some of it won’t.”

When the Web took off, and you offered pages as a business or an individual, you didn’t foresee the link-making that would occur. You didn’t foresee that PageRank would ultimately give you a measure of your importance and relevance in the world and could even be monetized after the fact. You didn’t foresee that those pages have their own essential network effect, that the more pages there are that interconnect, that there’s value being created out of it and so there’s is a strong argument [for publishing them].

So, you know, just publish. In truth, the demand side is an absolutely great and essential test of whether actually [publishing data] does matter.

Again, to take the Web as an analogy, large amounts of the Web are unattended to, neglected, and rot. It’s just stuff nobody cares about, actually. What we’re seeing in the open data effort in the UK is that it’s clear that some data is very privileged. It’s at the center of lots of other datasets.

In particular, [data about] location, occurrence, and when things occurred, and stable ways of identifying those things which are occurring. Then, of course, the data space that relates to companies, their identifications, the contracts they call, and the spending they engage in. That is the meat and drink of business intelligence apps all across the planet. If you started to turn off an ability for any business intelligence to access legal identifiers or business identifiers, all sorts of oversight would fall apart, apart from anything else.

The demand side [of open data] can be characterized. It’s not just economic. It will have to do with transparency, accountability and regulatory action. The economic side of open data gives you huge room for maneuver and substantial credibility when you can say, “Look, this dataset of spending data in the UK, published by local authorities, is the subject of detailed analytics from companies who look at all data about how local authorities and governments are spending their data. They sell procurement analysis insights back to business and on to third parties and other parts of the business world, saying ‘This is the shape of how the UK PLC is buying.’”

What are some of the lessons we can learn from how the World Wide Web grew and the value that it’s delivered around the world?

That’s always a worry, that, in some sense, the empowered get more powerful. What we do see is that, in open data in particular, new sorts of players couldn’t enter the game at all.

My favorite example is in mass transportation. In the UK, we have to fight quite hard to get some of the data from bus, rail and other forms of transportation made openly available. Until that was done, there was a pretty small number of supplies from this market.

In London, where all of it was made available from the Transport for London Authority, there’s just been an explosion of apps and businesses who are giving you subtly and distinct experiences as users of that data. I’ve got about eight or nine apps on my phone that give me interestingly distinctive views of moving about the city of London. I couldn’t have predicted or anticipated many of those exist.

I’m sure the companies who held that data could’ve spent large amounts of money and still not given me anything like the experience I now have. The flood of innovation around the data has really been significant and many, many more players and stakeholders in that space.

The Web taught us that serendipitous reuse, where you can’t anticipate where the bright idea comes from, is what is so empowering. The flipside of that is that it also reveals that, in some cases, the data isn’t necessarily of a quality that you might’ve thought. This effort might allow for civic improvement or indeed, business improvement in some cases, where businesses come and improve the data the state holds.

What’s happening in the UK with the so-called “MiData Initiative,” which posits that people have a right to access and use personal data disclosed to them?

I think this is every bit as potentially disruptive and important as open government data. We’re starting to see the emergence of what we might think of as a new class of important data, “personal assets.”

People have talked about “personal information management systems” for a long time now. Frequently, it’s revolved around managing your calendar or your contact list, but it’s much deeper. Imagine that you, the consumer, or you, the citizen, had a central locus of authority around data that was relevant to you: consumer data from retail, from the banks that you deal with, from the telcos you interact with, from the utilities you get your gas, water and electricity from. Imagine if that data infosphere was something that you could access easily, with a right to reuse and redistribute it as you saw fit.

The canonical example, of course, is health data. It isn’t all data that business holds, it’s also data the state holds, like your health records, educational transcript, welfare, tax, or any number of areas.

In the UK, we’ve been working towards empowering consumers, in particular through this MiData program. We’re trying to get to a place where consumers have a right to data held about their transactions by businesses, [released] back to them in a reusable and flexible way. We’ve been working on a voluntary program in this area for the last year. We have a consultation on taking up power to require large companies to give that information back. There is a commitment to the UK, for the first time, to get health records back to patients as data they control, but I think it has to go much more widely.

Personal data is a natural complement to open data. Some of the most interesting applications I’m sure we’re going to see in this area are where you take your personal data and enrich it with open data relating to businesses, the services of government, or the actual trading environment you’re in. In the UK, we’ve got six large energy companies that compete to sell energy to you.

Why shouldn’t groups and individuals be able to get together and collectively purchase in the same way that corporations can purchase and get their discounts? Why can’t individuals be in a spot market, effectively, where it’s easy to move from one supplier to another? Along with those efficiencies in the market and improvements in service delivery, it’s about empowering consumers at the end of the day.

This post is part of our ongoing series on the open data economy.

tags: , , , ,
  • http://www.oss.net RobertDavidSTEELEVivas

    Sigh. O’Reilly has been beating the open source drum on one channel for the past fifteen years. Neither open data nor open government are going anywhere. The meme is Open Source Everything as in openbts, open cloud, open data, open hardware, open software, open standards, etc. I wrote the book, THE OPEN SOURCE EVERYTHING MANIFESTO: Transparency, Truth, & Trust and have put a preliminary list of Opens at http://tinyurl.com/OSE-LIST. Bottom line: until we go “all in” on all the opens, nothing is affordable, scalable, or meaningful. IMHO.

    • Alexander Howard

      Hi Robert,

      Your knowledge of the history of technology is more precise than mine: O’Reilly and other people in the tech industry have been talking about open source software since 1998. I was a college senior at the time and more focused on the biology of the natural world than any software ecoystems online.
      http://en.wikipedia.org/wiki/Open-source_software#History

      I’ve focused much more on open source, open data and open government since then and shared a fair bit of what I’ve learned along the way.
      http://cyber.law.harvard.edu/events/luncheon/2012/03/howard

      From what I can tell, open source software is affordable and scalable. That’s why a generation of engineers learned the LAMP stack and another one is growing up learning Hadoop, OpenStack and social coding on Github. The next generation of digital infrastructure looks open. As Roger said, “Think Git, D3, Storm, Node.js, Rails, Mongo, Mesos or Spark.”
      http://programming.oreilly.com/2012/07/open-source-won.html

      Open source is mainstream now, adopted across enterprises and government. Open source software and hardware are a core part of the infrastructure that is democratizing the ability to make sense of petabytes of data. That’s means something.

      So, too, does open government. A worldwide push for access to information against corruption and towards more participatory governance is ongoing.

      If you don’t believe me, look to another masthead: at the end of 2012, the Economist found that global open government efforts are growing in “scope and clout.”
      http://radar.oreilly.com/2012/12/10-trends-from-2012.html

      I don’t think the world is confronted by an “all or nothing,” binary choice when it comes to “the opens.” The world is full of grey areas, particularly when we are confronted by the new issues stemming from technological change. There are rules, regulations, laws and ethics to consider as guides, along with the human context of each decision.

      Making some categories of personal or government data open can violate people’s privacy rights or put people in danger. Keeping other data secret can hide corruption, fraud, graft or incompetence.

      We can also open some things and not others — and still have it be “meaningful.” For example, consider weather data, GPS signals and the US Census or books in the Library of Congress and the great city libraries.

      Thank you for the comment and the link to your list.

      -Alex

      • http://www.oss.net RobertDavidSTEELEVivas

        No argument at all with your thoughtful comment, just one observation from Colin Gray, Modern Strategy: time is the one strategic variable that cannot be bought nor replaced. I had most of this scoped out in 1988-1992, then a bit more in 1994-1995, and although I was elected to the Silicon Valley Hackers Conference for my efforts, even making the Microtimes 100 list twice, everybody — without exception — refused to pay attention. Real heros like Richard Stallman have been marginalized but are now coming out strong. The bottom line is that SPEED OF SCALE matters, and so does BREADTH OF SCALE. That takes an all in approach, and also demands that data be completely independent of all software and hardware. Vendors lies like dogs about data compatibility and then over-charge for data conversaion. We have a long way to go, but I would say we are at the end of the beginning. Semper Fi, Robert

      • http://twitter.com/FFreeDemocracy FFreeDemocracy

        Hi Alexander,

        Sorry to intrude here. Open source software should be scalable by nature, if it really has the vocation of being free software (free as in “free speech”, not as in “free beer”, as Mr Stallman would put it). However, this does not always depend on the will of the programmer, but on the quality of the software’s architecture and design and therefore in the programmer’s ability to foresee future needs. Therefore a small project should really have the soul of a large project, and in that we programmers should improve. Open source software, being open, would allow for a small core to grow though, but this is not always easy.

        My two, very humble cents.

        • digiphile

          Not an intrusion at all! These comments are open for a reason.

  • Laurence Webb

    Environmental data on several million businesses is another dataset that’s very lacking at the moment – something that http://www.amee.com is trying to change.

    The answer as to whether this dataset has economic utility is certainly yes. In light of growing resource constraints procurement managers are in real need of ways to make their supply chains more efficient.

    Knowing which suppliers expose them to most environmental risk is the first step in changing business practice for the better. To enable that we need transparent environmental data.

    • Alexander Howard

      Thank you for the comment, Laurence. To date, how useful as the environmental data shared by the USA EPA, NOAA and European government agencies been to your work?

  • http://www.facebook.com/roger.barrow Roger Barrow

    I thought this was supposed to be an article about combining open linked data with Artificial Intelligence.

    • http://twitter.com/FFreeDemocracy FFreeDemocracy

      Hi Roger,

      Augmented intelligence does not always have to be artificial intelligence. There are many studies which confirm that collective intelligence is higher than the mere sum of the individuals, because it triggers ideas and behaviors. This is the base of the somewhat dated brainstorming.

      Take for instance “Wolfbane”, the novel by Pohl and Kornbluth. What if we could link with other beings as Tropile was in the Snowflake? The story was published in 1959, mind you. Sounds familiar?

      Data is the origin of information, and information is a basic need for intelligent decision making. If we had a way of gathering big data and treat it as small data to obtain useful information, would that not be wonderful? And the intelligence used in that decision-making process would really be augmented.