Will data be too cheap to meter?

Last week at Strata I got into an argument with a journalist over the future of CrunchBase. His position was that we were just in a “pre-commercial” world, that creating the database required a reporter’s time, and so after the current aberration had passed we’d return to the old status quo where this kind of information was only available through paid services. I wasn’t so sure.

When I explain to people why the Big Data movement is important — why it’s a real change instead of a fad — I point to price as the fundamental difference between the old and new worlds. Until a few years ago, the state of the art for doing meaningful analysis of multi-gigabyte data sets was the data warehouse. These custom systems were very capable, but could easily cost millions of dollars. Today I can hire a hundred machine Hadoop cluster from Amazon for just $10 an hour, and process thousands of gigabytes a day.

This represents a massive discontinuity in price, and it’s why Big Data is so disruptive. Suddenly we can imagine a group of kids in their garage building Google-scale systems practically on pocket money. While the drop in the cost of data storage and transmission has been less dramatic, it has followed a steady downward trend over the decades. Now that processing has become cheap too, a whole universe of poverty-stricken hackers, academics, makers, reporters, and startups can do interesting things with massive data sets.

Why does this have to do with CrunchBase? The reporter had some implicit assumptions about the cost of the data collection process. He argued that it required extra effort from the journalists to create the additional value captured in the database. To paraphrase him: “It’s time they’d rather spend at home playing with their kids, and so we’ll end up compensating them for their work if we want them to continue producing it.” What I felt was missing from this is that CrunchBase might actually be just a side-effect of work they’d be doing even if it wasn’t released for public consumption.

Many news organizations are taking advantage of the dropping cost of data handling by heavily automating their news-gathering and publishing workflows. This can be as simple as Google Alerts or large collections of RSS feeds to scan, using scraping tools to gather public web data, and there’s a myriad of other information-processing techniques out there. Internally there’s a need to keep track of the results of manual or automated research, and so the most advanced organizations are using some kind of structured database to capture the information for future use.

That means that that the only extra effort required to release something like CrunchBase is publishing it to the web. Assuming that there’s some benefits to doing so (that TechCrunch’s reputation as the site-of-record for technology company news is enhanced, for example) and that there’s multiple companies with the data available, then the low cost of the release will mean it makes sense to give it away.

I actually don’t know if all these assumptions are true, CrunchBase’s approach may not be sustainable, but I hope it illustrates how a truly radical change in price can upset the traditional rules. Even on a competitive, commercial, free-market playing field it sometimes makes sense to behave in ways that appear hopelessly altruistic. We’ve seen this play out with open-source software. I expect to see pricing forces do something similar to open up more and more sources of data.

I’m usually the contrarian guy in the room arguing that information wants to be paid, so I don’t actually believe (as Lewis Strauss famously said about electricity) all data will be too cheap to meter. Instead I’m hoping we’ll head toward a world where producers of information are paid for adding real value. Too many “premium” data sets are collated and merged from other computerized sources, and that process should be increasingly automatic, and so increasingly cheap. Give me a raw CrunchBase culled from press releases and filings for free, then charge me for your informed opinion on how likely the companies are to pay their bills if I extend them credit. Just as free, open-source software has served as the foundation for some very lucrative businesses, the new world of free public data will trigger a flood of innovations that will end up generating value in ways we can’t foresee, and that we’ll be happy to pay for.

Will data be too cheap to meter?

Data acquisition for a site like CrunchBase may not carry the costs some assume.

Get the O’Reilly Data Newsletter