Will data be too cheap to meter?

Data acquisition for a site like CrunchBase may not carry the costs some assume.

CrunchBaseLast week at Strata I got into an argument with a journalist over the future of CrunchBase. His position was that we were just in a “pre-commercial” world, that creating the database required a reporter’s time, and so after the current aberration had passed we’d return to the old status quo where this kind of information was only available through paid services. I wasn’t so sure.

When I explain to people why the Big Data movement is important — why it’s a real change instead of a fad — I point to price as the fundamental difference between the old and new worlds. Until a few years ago, the state of the art for doing meaningful analysis of multi-gigabyte data sets was the data warehouse. These custom systems were very capable, but could easily cost millions of dollars. Today I can hire a hundred machine Hadoop cluster from Amazon for just $10 an hour, and process thousands of gigabytes a day.

This represents a massive discontinuity in price, and it’s why Big Data is so disruptive. Suddenly we can imagine a group of kids in their garage building Google-scale systems practically on pocket money. While the drop in the cost of data storage and transmission has been less dramatic, it has followed a steady downward trend over the decades. Now that processing has become cheap too, a whole universe of poverty-stricken hackers, academics, makers, reporters, and startups can do interesting things with massive data sets.

Why does this have to do with CrunchBase? The reporter had some implicit assumptions about the cost of the data collection process. He argued that it required extra effort from the journalists to create the additional value captured in the database. To paraphrase him: “It’s time they’d rather spend at home playing with their kids, and so we’ll end up compensating them for their work if we want them to continue producing it.” What I felt was missing from this is that CrunchBase might actually be just a side-effect of work they’d be doing even if it wasn’t released for public consumption.

Many news organizations are taking advantage of the dropping cost of data handling by heavily automating their news-gathering and publishing workflows. This can be as simple as Google Alerts or large collections of RSS feeds to scan, using scraping tools to gather public web data, and there’s a myriad of other information-processing techniques out there. Internally there’s a need to keep track of the results of manual or automated research, and so the most advanced organizations are using some kind of structured database to capture the information for future use.

That means that that the only extra effort required to release something like CrunchBase is publishing it to the web. Assuming that there’s some benefits to doing so (that TechCrunch’s reputation as the site-of-record for technology company news is enhanced, for example) and that there’s multiple companies with the data available, then the low cost of the release will mean it makes sense to give it away.

I actually don’t know if all these assumptions are true, CrunchBase’s approach may not be sustainable, but I hope it illustrates how a truly radical change in price can upset the traditional rules. Even on a competitive, commercial, free-market playing field it sometimes makes sense to behave in ways that appear hopelessly altruistic. We’ve seen this play out with open-source software. I expect to see pricing forces do something similar to open up more and more sources of data.

I’m usually the contrarian guy in the room arguing that information wants to be paid, so I don’t actually believe (as Lewis Strauss famously said about electricity) all data will be too cheap to meter. Instead I’m hoping we’ll head toward a world where producers of information are paid for adding real value. Too many “premium” data sets are collated and merged from other computerized sources, and that process should be increasingly automatic, and so increasingly cheap. Give me a raw CrunchBase culled from press releases and filings for free, then charge me for your informed opinion on how likely the companies are to pay their bills if I extend them credit. Just as free, open-source software has served as the foundation for some very lucrative businesses, the new world of free public data will trigger a flood of innovations that will end up generating value in ways we can’t foresee, and that we’ll be happy to pay for.

tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Funny –

    My sense is that there isn’t a lot of editorial in crunchbase (except stuff that’s lifted from Techcrunch articles, which do indeed result from editorial effort.) An awful lot of the material has always seemed to me to be spidered or cut and paste from other sites (e.g. we’ve often found speaker biographies from O’Reilly conference listings as the bio of people listed in Crunchbase.)

    More substantially, I don’t think Crunchbase counts as a big data play in any case. If it’s not big enough to require algorithmic curation, it’s not big enough to matter.

    It’s actually small data that people will pay for directly (stuff that’s very targeted and high value – see for example RGEmonitor.com); big data will almost always be monetized by other means (e.g. advertising like Yelp, Groupon, Facebook, or Google, e-commerce like Amazon or Apple (iTunes/App Store), financial trading, business services like Palantir or Passur Aerospace, oil and gas exploration, etc.)

  • Ian

    Small Data–you read it here first!

  • Thanks Tim. I take your point that the scale of Crunchbase is well below what should qualify as Big Data. It’s more that the changes that have made truly large-scale data processing affordable make the technical side of publishing this size of data set practically free.

    I’m not so sold on algorithmic curation as a litmus test though. I’d argue that Wikipedia and Facebook’s data sets should qualify, and while there are bots and other helpers in those systems, they both rely heavily on human editors.