There's no such thing as big data

Even if you have petabyes of data, you still need to know how to ask the right questions to apply it.

“You know,” said a good friend of mine last week, “there’s really no such thing as big data.”

I sighed a bit inside. In the past few years, cloud computing critics have said similar things: that clouds are nothing new, that they’re just mainframes, that they’re just painting old technologies with a cloud brush to help sales. I’m wary of this sort of techno-Luddism. But this person is sharp, and not usually prone to verbal linkbait, so I dug deeper.

He’s a ridiculously heavy traveler, racking up hundreds of thousands of miles in the air each year. He’s the kind of flier airlines dream of: loyal, well-heeled, and prone to last-minute, business-class trips. He’s is exactly the kind of person an airline needs to court aggressively, one who represents a disproportionally large amount of revenues. He’s an outlier of the best kind. He’d been a top-ranked passenger with United Airlines for nearly a decade, using their Mileage Plus program for everything from hotels to car rentals.

And then his company was acquired.

The acquiring firm had a contractual relationship with American Airlines, a competitor of United with a completely separate loyalty program. My friend’s air travel on United and its partner airlines dropped to nearly nothing.

He continued to book hotels in Shanghai, rent cars in Barcelona, and buy meals in Tahiti, and every one of those transactions was tied to his loyalty program with United. So the airline knew he was traveling — just not with them.

Astonishingly, nobody ever called him to inquire about why he’d stopped flying with them. As a result, he’s far less loyal than he was. But more importantly, United has lost a huge opportunity to try to win over a large company’s business, with a passionate and motivated inside advocate.

And this was his point about big data: that given how much traditional companies put it to work, it might as well not exist. Companies have countless ways they might use the treasure troves of data they have on us. Yet all of this data lies buried, sitting in silos. It seldom sees the light of day.

When a company does put data to use, it’s usually a disruptive startup. Zappos and customer service. Amazon and retailing. Craigslist and classified ads. Zillow and house purchases. LinkedIn and recruiting. eBay and payments. Ryanair and air travel. One by one, industry incumbents are withering under the harsh light of data.

Strata Jumpstart New York 2011, being held on September 19, is a crash course in how to manage the data deluge that’s transforming traditional business practices across the board. Jumpstart is an intense, day-long deep dive for managers, strategists, and entrepreneurs who are putting the promise of big data into practice.

30% on registration with the code STN11RAD

Big data and the innovator’s dilemma

Large companies with entrenched business models tend to cling to their buggy-whips. They have a hard time breaking their own business models, as Clay Christensen so clearly stated in “The Innovator’s Dilemma,” but it’s too easy to point the finger at simple complacency.

Early-stage companies have a second advantage over more established ones: they can ask for forgiveness instead of permission. Because they have less to lose, they can make risky bets. In the early days of PayPal, the company could skirt regulations more easily than Visa or Mastercard, because it had far less to fear if it was shut down. This helped it gain marketshare while established credit-card companies were busy with paperwork.

The real problem is one of asking the right questions.

At a big data conference run by The Economist this spring, one of the speakers made a great point: Archimedes had taken baths before.

(Quick historical recap: In an almost certainly apocryphal tale, Hiero of Syracuse had asked Archimedes to devise a way of measuring density, an indicator of purity, in irregularly shaped objects like gold crowns. Archimedes realized that the level of water in a bath changed as he climbed in, making it an indicator of volume. Eureka!)

The speaker’s point was this: it was the question that prompted Archimedes’ realization.

Small, agile startups disrupt entire industries because they look at traditional problems with a new perspective. They’re fearless, because they have less to lose. But big, entrenched incumbents should still be able to compete, because they have massive amounts of data about their customers, their products, their employees, and their competitors. They fail because often they just don’t know how to ask the right questions.

In a recent study, McKinsey found that by 2018, the U.S. will face a shortage of 1.5 million managers who are fluent in data-based decision making. It’s a lesson not lost on leading business schools: several of them are introducing business courses in analytics.

Ultimately, this is what my friend’s airline example underscores. It takes an employee, deciding that the loss of high-value customers is important, to run a query of all their data and find him, and then turn that into a business advantage. Without the right questions, there really is no such thing as big data — and today, it’s the upstarts that are asking all the good questions.

When it comes to big data, you either use it or lose.

This is what we’re hoping to explore at Strata JumpSsart in New York next month. Rather than taking a vertical look at a particular industry, we’re looking at the basics of business administration through a big data lens. We’ll be looking at apply big data to HR, strategic planning, risk management, competitive analysis, supply chain management, and so on. In a world flooded by too much data and too many answers, tomorrow’s business leaders need to learn how to ask the right questions.


tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Charlie

    “big data” may refer to a useful concept, but it leaves a distinctly “marketing buzzword” taste in my mouth, which puts me off reading further when I encounter it in an article.

    I implore writers considering using such a term to dig deeper and instead give a name to the part of their subject which is actually interesting.

  • It’s a valid point, but it’s also possible that United is using big data to solve other problems, even customer service problems.

    Sometimes the best questions (“Among customers we’re losing, which ones could be high-level sponsors into major new accounts?”) are not asked, because they’re considered outliers or are simply too obvious to attract attention.

  • @Charlie: I agree wholeheartedly. Big-tent terms tend to help with a concept, then lose all usefulness when trying to have useful discussions. In the case of cloud computing, that’s certainly what happened.

    @Robert: An example I didn’t use (because this was already too long) was that of Netflix. Reed Hastings’ business proposal was a spreadsheet. He simply asked, “what if I mailed DVDs to people instead of opening stores?” Know who had better data than him on rental patterns and where renters live? Blockbuster. Heck, they even had DVD inventory to spare.

    The real question is why isn’t Blockbuster Netflix? They failed to ask the question, in spite of having all the answers at their disposal.

  • The simple truth is that data is becoming much easier to obtain (especially Internet analytics) and much cheaper to store in enormous quantities. Along with asking the right questions and obtaining the right data, it is also imperative not to completely trust the data results, especially when statistical inference is involved. Results must be evaluated in context and combined with human logic. That combination, however, is powerful and likely far too uncommon.

  • Well…
    One of the advantages of well-designed decision making systems is that they help you evolve from hypothesis verification (where, as you say, “you still need to know how to ask the right questions to apply it.”) to an area of assisting hypothesis generation (via visualizations, running models and simulations, etc).

    In addition, a lot of the ‘right questions’ we ask are in a way ‘templates’, which are amenable to a lot of automatic data crunching that transform a vague idea into specific questions.
    E.g. for google flu trends the hypothesis is “people behave differently online [eg search for different things] when they don’t feel well” and “we can discover what those are based on specific disease historical information” and “we can build a predictive model around these”. *how* behavior changes online is a minor detail, the algorithm hitting the data can figure it out.

    I agree that there is nothing inherently ‘new’ about big data other than a subjective label of what ‘big’ means circa 2011; but the data available has been growing exponentially; in areas that were not easy for scientific research to go into.

    My point is that a system that forces you to “need to know how to ask the right questions to apply it” has room to grow; and I personally look forward to ones that help with hypothesis generation with more excitement than the alternative.

  • That’s why it’s called “Big Data”. If it was in some sort of useful form it would be called “Big Information”.

  • Alex Tolley

    “The real question is why isn’t Blockbuster Netflix? They failed to ask the question, in spite of having all the answers at their disposal.”

    My guess, based on experience in another business domain, is that Blockbuster believed that the stores
    were the barrier to entry. The whole emphasis on stores is immediate service to the customer – she walks in and rents a movie for the evening. That couldn’t be done by snail mail. They were blindsided by Netflix’ business model which allowed unlimited time to hold the movie – unthinkable if you needed inventory of the new movies in the stores on release to satisfy the immediacy demand.

  • Alex Tolley

    “Large companies with entrenched business models tend to cling to their buggy-whips. They have a hard time breaking their own business models, as Clay Christensen so clearly stated in “The Innovator’s Dilemma,” but it’s too easy to point the finger at simple complacency. “

    You can ask all the right questions, but that won’t help if the answer is “whatever you do, the business will be worth less”. For incumbent senior execs, that would be committing financial suicide. Better to stay the course and negotiate an exit while your reputation and stock options are still good.

    Even with the best will in the world, there is going to be huge cognitive dissonance if the old model and institutions that have been invested in have to be swept away.

  • Love it!
    I would argue that big data does exist though. And just like so many things before it, people haven’t come to grips with how to make the best use of it just yet.
    – The first computers with email were in a room away from everyone, just like a fax machine – but there was email.

    I am taking the point that having the right question (followed very soon after by a clear answer) is the way you end up with a good use for data (big, small and all sizes in between) and it results in useful insights.
    And hopefully I get to help people bridge their data-to-insight gap :-)

  • I call a customer care center over five times on the same problem and they keep asking me the same question….’how may we help you today?’ ..i am sure they record phone logs which should also exist in a text format. With all the data they have….they should be able to know why i am calling and that they have not solved my problem. Accumulation of data is useless until it can produce useful information.

  • In the beginning Big Data referred to a new way to store data so that you could build internet sized data storage. After that we stopped inventing more terms and just lumped in new concepts around Big Data into that term. All the nuances we were discovering might as well be called Smurf.

    Now that we have the data stored in an accessible way in order to build things on it we’re starting to discover what we can do with it. What this article is talking about is Big Analytics. How do we make those connections between all this data and correlate it in meaningful ways? Right now the way that’s done is cumbersome. But even with all the smarty pants architecture and data at ones disposal, you can’t make analytics ask the question for you.

  • As a long time builder of “Big Data” systems… I might point out that asking the “right” question, is a process. It starts with a fairly fundamental business measurement – and then drills down into the why.

    In you United use case, the root question would be – where are we loosing our high value customers… and then having the systems and processes to build new categories as you drill into/across the data.

    and I really relate to the example, because I happen to be one of those United Million milers who stopped traveling… I just went to work for a different company…

  • I’m a humanist, in terms of academic training. Specifically, I’m a textual scholar, trained to examine texts, and, well, ask them questions, which I then use the text(s) to answer.

    It occurs to me that companies with large data structures might do well to engage humanists to ask questions of the data.

  • Big Data is a misnomer because data size is secondary. We use the term Big Data to refer to data that is stored in a non-traditional format which makes it harder to use. It just happens to be large in size.

    It should probably be called “New and different data which we could not think of a name for so we are calling Big”

  • Gunther Hust

    Forgive me for playing the heretic here, but your United example here is somewhat of a strawman. United does engage in data driven marketing. They’ve been mailing me offers for their Mileage Plus plan for over a year now. When I call in to ask them to please stop sending them, what do they do? They send more of them.

    I’m sure United like customers like your friend, but only up to a point. They don’t end of life mileage points for heavy users to keep them around, they do it so those customers have to ante up again with cash purchases to keep their status. United knows exactly how much each customer makes them, and which market segments represent overall revenue opportunities. If they didn’t, they wouldn’t be sending me $0.50 mailers 20 times a year. Going after one customer who is now with a competitor’s loyalty program might just not be at all profitable in their eyes.

    Companies are out there and definitely using big data, and have been for many years now.

  • I’ve spent the last two years pondering how to shake business models out of finding value in data which is easily available and ignored by others. FlightCaster and Recollect are two of my favourite examples of companies that took the same information everyone else had, applied creativity and integration smarts and somehow made the world better while making money.

    Demand Media isn’t my idea of a cool or even good company, but I respect the fact that they noticed humans are really bad at figuring out what they don’t know. Analyzing bulk search results to see what people were looking for but not finding is several shades of brilliant. It’s just too bad that they use that information to seed the creation of lowest common denominator content. is our attempt at making datasets a first-class entity on the web, turning them into social objects in the same way Flickr images and Vimeo videos are. I believe that by giving data publishers the ability to curate a dataset and connect with the people who use them, BuzzData is the most obvious place for cross-pollination of business ideas for datasets to occur on the web today.

    Further, I suspect that many of the potential innovators in the data entrepreneurship space will be from developing countries like Kenya and Nigeria. This excites me more than any other aspect of the data conversation.

    The fact is that Recollect could have been created anywhere, and the people who benefit from it won’t care where it’s from because they’re just happy it works at all. Getting SMS notifications to take out your garbage? That’s Star Trek talk to most people you see on the street.

    BTW, I’m perplexed by your choice of title for this article, Alistair. Isn’t claiming that there’s no such thing as Big Data because traditional corporations aren’t using it sort of like claiming there’s no such thing as meat because vegans don’t eat it? :)

  • Big Data problem is not only about size of the data but it is also about performance and how fast can data be processed. So for that there are cloud services which offers platforms and tools to perform analytic quickly and efficiently.

  • Guy

    A quick observation:
    – Nobel laureates who won the prize for hard sciences were typically young (in their twenties when they did the work) while laureates who won in other categories were typically older.

    Perhaps this is because the “non-hard-science” disciplines require a practioneer to develop and refine a style, a technique, a skill while the hard sciences reward people who ask (and answer) new questions (something that is more likely when a person is young and before their thought processes have settled into a pattern — something that is a most human trait, even for scientists).

    If so, do companies exhibit similar traits? Is “asking the right question” more a function of the “youth” of the company than other circumstances? Do older, established companies suffer the corporate culture malady of “established thought processes”?

    For any researchers out there, this might be an interesting topic to keep you off the streets.

  • Steve

    I have used the term VAST data to a four way matrix array of Variables (V), Alternatives (A), Subjects (S), and Time (T), where one or more of these is in the thousands, tens of thousands, or even millions. VAST databases pose new challenges for the marketing scientist. Analysts are accustomed to reducing the scope of such databases by sampling subjects (consumers) or through data reduction on variables. Sampling subjects permits simpler computation, but also misses the nuances and idiosyncrasies available in the full dataset. Data reduction can be used because the data space is well-populated. Yet in a typical VAST dataset, as the number of these four dimensions increases beyond the usual several hundred, any local place in the “cloud” of data with several hundred variables is most likely empty. So in contrast to smaller datasets, where there are masses of data with a relative few outliers, in VAST databases, the data space is mostly empty with almost all data points as outliers.

    As such VAST data, requires:

    1. New algorithms for data analysis;
    2. State-of-the-art computational power; and/or,
    3. Striking a balance between computation and storage.

  • “big, entrenched incumbents should still be able to compete, because they have massive amounts of data about their customers, their products, their employees, and their competitors. “

    I love that point. I once read a fantastic piece by Michael Nielsen on why established industries get disrupted, and it’s often more about having a business model nearly wholly reliant on legacy infrastructure, norms and standards, than it is about being obtuse or backward, as so many people would like to believe. Ie: established companies have more to lose, as you said, so they avoid those ‘risky’ questions. Which is exactly why they should be making more intelligent use of the data they have, because it’s a way to hypothesize and assess risk, without actually *taking* risks.

    That said, i do think another issue inhibiting companies’ better use of data is not just silo-ing from a technological standpoint, but from a human resources standpoint. Too many companies presume data to be something only a very small fraction of their workforce are capable of wrangling or understanding. Effective, simple and easy communication of data within workplaces could make a huge difference to firms’ capacity to innovate.

    It’s like how they say the average person only uses 5% of their brain. Companies often restrict data communication/collaboration to 5% of their human resources. For the love of god, why?

  • SJ Bennett

    By the way, Mr. Croll, I understood your examples, which you offered to set up your topic, and not meant to be your point.
    I’m sure I am simple-minded at best, but I’ve spent the past few years trying to “dumb down” all my professional learnings and get back to using my God (and parent) given, common sense – first. In recent years, I’ve also learned that common sense is greatly enhanced with “life exposure” and success can be measured by how you apply facts (data) to (life) learnings. If today’s business leaders are unable to “ask the right questions”, I’d guess it’s because they’ve never had to actually “live those questions”. If they had, they’d likely have both the question and the answer and be able to use all that data to come up with new ideas and different solutions.
    Yes, I could be completely off-base and a good dose of common sense and the ability to see what’s happening right in front of you, might not have helped our congressional leaders during the debt crisis negotiations. All the available data and trending in the world still did not make them leaders.

  • Interesting: Even if you have petabyes of data, you still need to know how to ask the right questions to apply it.

    So if big data is about the questions one asks rather than the absolute size of that data, can “megabytes” be considered “big data”?

    Everyone likes to talk about how big you need to be until big is big. But if it’s really about the questions, rather than the size, I’d like to ask how small it can be, and still be big data.

    The reason is that there are some domains, tasks, verticals, etc. that will by their very nature never have petabytes, terabytes, or even gigabytes. Yet if one is still doing intelligent analytics and asking the right questions, can those megabytes still be considered “big data”?

  • There is a lot of buzz today about big data and companies stepping up to meet the challenge of ever increasing data volumes.
    Don’t forget ‘small’data
    What about “small” and medium-sized data? For example, data from spreadsheet, the occasional flat file, leads from a trade show, and catalog data from vendors may be vital to your business processes. With a new industry focus on transparency, business user involvement and sharing of data, small data is a constant issue.