Three kinds of big data

In the past couple of years, marketers and pundits have spent a lot of time labeling everything “big data.” The reasoning goes something like this:

Everything is on the Internet.
The Internet has a lot of data.
Therefore, everything is big data.

When you have a hammer, everything looks like a nail. When you have a Hadoop deployment, everything looks like big data. And if you’re trying to cloak your company in the mantle of a burgeoning industry, big data will do just fine. But seeing big data everywhere is a sure way to hasten the inevitable fall from the peak of high expectations to the trough of disillusionment.

We saw this with cloud computing. From early idealists saying everything would live in a magical, limitless, free data center to today’s pragmatism about virtualization and infrastructure, we soon took off our rose-colored glasses and put on welding goggles so we could actually build stuff.

So where will big data go to grow up?

Once we get over ourselves and start rolling up our sleeves, I think big data will fall into three major buckets: Enterprise BI, Civil Engineering, and Customer Relationship Optimization. This is where we’ll see most IT spending, most government oversight, and most early adoption in the next few years.

Enterprise BI 2.0

For decades, analysts have relied on business intelligence (BI) products like Hyperion, Microstrategy and Cognos to crunch large amounts of information and generate reports. Data warehouses and BI tools are great at answering the same question—such as “what were Mary’s sales this quarter?”—over and over again. But they’ve been less good at the exploratory, what-if, unpredictable questions that matter for planning and decision-making because that kind of fast exploration of unstructured data is traditionally hard to do and therefore expensive.

Most “legacy” BI tools are constrained in two ways:

First, they’ve been schema-then-capture tools in which the analyst decides what to collect, then later captures that data for analysis.
Second, they’ve typically focused on reporting what Avinash Kaushik (channeling Donald Rumsfeld) refers to as “known unknowns”—things we know we don’t know, and generate reports for.

These tools are used for reporting and operational purposes, and are usually focused on controlling costs, executing against an existing plan, and reporting on how things are going.

As my Strata co-chair Edd Dumbill pointed out when I asked for thoughts on this piece:

“The predominant functional application of big data technologies today is in ETL (Extract, Transform, and Load). I’ve heard the figure that it’s about 80% of Hadoop applications. Just the real grunt work of log file or sensor processing before loading into an analytic database like Vertica.”

The availability of cheap, fast computers and storage, as well as open source tools, have made it okay to capture first and ask questions later. That changes how we use data because it lets analysts speculate beyond the initial question that triggered the collection of data.

What’s more, the speed with which we can get results—sometimes as fast as a human can ask them—makes data easier to explore interactively. This combination of interactivity and speculation takes BI into the realm of “unknown unknowns,” the insights that can produce a competitive advantage or an out-of-the-box differentiator.

Cloud computing underwent a transition from promise to compromise. First big, public clouds wooed green-field startups. Then, a few years later, incumbent IT vendors introduced private cloud offerings. These private clouds included only a fraction of the benefits of their public cousins—but were nevertheless a sufficient blend of smoke, mirrors, and features to delay the inevitable move to public resources by a few years and appease the business. For better or worse, that’s where most IT cloud budgets are being spent today according to IDC, Gartner, and others. Big data adoption will undergo a similar cycle.

In the next few years, then, look for acquisitions and product introductions—and not a little vaporware—as BI vendors that enterprises trust bring them “big data lite”: enough innovation and disruption to satisfy the CEO’s golf buddies, but not so much that enterprise IT’s jobs are threatened. This, after all, is how change comes to big organizations.

Ultimately, we’ll see traditional “known unknowns” BI reporting living alongside big-data-powered data import and cleanup, and fast, exploratory data “unknown unknown” interactivity.

Civil Engineering

The second use of big data is in society and government. Already, data mining can be used to predict disease outbreaks, understand traffic patterns, and improve education.

Cities are facing budget crunches, infrastructure problems, and a crowding from rural citizens. Solving these problems is urgent, and cities are perfect labs for big data initiatives. Take a metropolis like New York: hackathons; open feeds of public data; and a population that generates a flood of information as it shops, commutes, gets sick, eats, and just goes about its daily life.

I think municipal data is one of the big three for several reasons: it’s a good tie breaker for partisanship, we have new interfaces everyone can understand, and we finally have a mostly-connected citizenry.

In an era of partisan bickering, hard numbers can settle the debate. So, they’re not just good government; they’re good politics. Expect to see big data applied to social issues, helping us to make funding more effective and scarce government resources more efficient (perhaps to the chagrin of some public servants and lobbyists). As this works in the world’s biggest cities, it’ll spread to smaller ones, to states, and to municipalities.

Making data accessible to citizens is possible, too: Siri and Google Now show the potential for personalized agents; Narrative Science takes complex data and turns it into words the masses can consume easily; Watson and Wolfram Alpha can give smart answers, either through curated reasoning or making smart guesses.

For the first time, we have a connected citizenry armed (for the most part) with smartphones. Nielsen estimated that smartphones would overtake feature phones in 2011, and that concentration is high in urban cores. The App Store is full of apps for bus schedules, commuters, local events, and other tools that can quickly become how governments connect with their citizens and manage their bureaucracies.

The consequence of all this, of course, is more data. Once governments go digital, their interactions with citizens can be easily instrumented and analyzed for waste or efficiency. That’s sure to provoke resistance from those who don’t like the scrutiny or accountability, but it’s a side effect of digitization: every industry that goes digital gets analyzed and optimized, whether it likes it or not.

Customer Relationship Optimization

The final home of applied big data is marketing. More specifically, it’s improving the relationship with consumers so companies can, as Sergio Zyman once said, sell them more stuff, more often, for more money, more efficiently.

The biggest data systems today are focused on web analytics, ad optimization, and the like. Many of today’s most popular architectures were weaned on ads and marketing, and have their ancestry in direct marketing plans. They’re just more focused than the comparatively blunt instruments with which direct marketers used to work.

The number of contact points in a company has multiplied significantly. Where once there was a phone number and a mailing address, today there are web pages, social media accounts, and more. Tracking users across all these channels — and turning every click, like, share, friend, or retweet into the start of a long funnel that leads, inexorably, to revenue is a big challenge. It’s also one that companies like Salesforce understand, with its investments in chat, social media monitoring, co-browsing, and more.

This is what’s lately been referred to as the “360-degree customer view” (though it’s not clear that companies will actually act on customer data if they have it, or whether doing so will become a compliance minefield). Big data is already intricately linked to online marketing, but it will branch out in two ways.

First, it’ll go from online to offline. Near-field-equipped smartphones with ambient check-in are a marketer’s wet dream, and they’re coming to pockets everywhere. It’ll be possible to track queue lengths, store traffic, and more, giving retailers fresh insights into their brick-and-mortar sales. Ultimately, companies will bring the optimization that online retail has enjoyed to an offline world as consumers become trackable.

Second, it’ll go from Wall Street (or maybe that’s Madison Avenue and Middlefield Road) to Main Street. Tools will get easier to use, and while small businesses might not have a BI platform, they’ll have a tablet or a smartphone that they can bring to their places of business. Mobile payment players like Square are already making them reconsider the checkout process. Adding portable customer intelligence to the tool suite of local companies will broaden how we use marketing tools.

Headlong into the trough

That’s my bet for the next three years, given the molasses of market confusion, vendor promises, and unrealistic expectations we’re about to contend with. Will big data change the world? Absolutely. Will it be able to defy the usual cycle of earnest adoption, crushing disappointment, and eventual rebirth all technologies must travel? Certainly not.

Related:

What is big data?

Three kinds of big data

Looking ahead at big data's role in enterprise business intelligence, civil engineering, and customer relationship optimization.

Enterprise BI 2.0

Civil Engineering

Customer Relationship Optimization

Headlong into the trough

Get the O’Reilly Data Newsletter