Before you interrogate data, you must tame it

IBM, Wolfram|Alpha, Google, Bing, groups at universities, and others are trying to develop algorithms that parse useful information from unstructured data.

This limitation in search is a dull pain for many industries, but it was sharply felt by data journalists with the WikiLeaks releases. In a recent interview, Simon Rogers (@smfrogers), editor of the Guardian’s Datablog and Datastore, talked about the considerable differences between the first batch of WikiLeaks releases — which arrived in a structured form — and the text-filled mass of unstructured cables that came later.

There were three WikiLeaks releases. One and two, Afghanistan and Iraq, were very structured. We got a CSV sheet, which was basically the “SIGACTS” — that stands for “significant actions” — database. It’s an amazing data set, and in some ways it was really easy to work with. We could do incredibly interesting things, showing where things happened and events over time, and so on.

With the cables, it was a different kettle of fish. It was just a massive text file. We couldn’t just look for one thing and think, “oh, that’s the end of one entry and the beginning of the next.” We had a few guys working on this for two or three months, just trying to get it into a state where we could have it in a database. Once it was in a database, internally we could give it to our reporters to start interrogating and getting stories out of it.

During the same interview, Rogers said that providing readers with the searchable data behind stories is a counter-balance to the public’s cynicism toward the media.

When we launched the Datablog, we thought it was just going to be developers [using it]. What it turned out to be, actually, is real people out there in the world who want to know what’s going on with a story. And I think part of that is the fact that people don’t trust journalists any more, really. They don’t trust us to be truthful and honest, so there’s a hunger to see the stories behind the stories.

For more about how Rogers’ group dealt with the WikiLeaks data and how data journalism works, check out the full interview in the following video:

Related:

Before you interrogate data, you must tame it

Simon Rogers on how The Guardian addressed unstructured WikiLeaks cables.

Get the O’Reilly Data Newsletter