What is big data?

An introduction to the big data landscape.

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.

The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.

The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook.

The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook.

Data image

What does big data look like?

As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing?

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.


The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. Having more data beats out having better models: simple bits of math can be unreasonably effective given large amounts of data. If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better?

This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it.

Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into a choice between massively parallel processing architectures — data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions. This choice is often informed by the degree to which the one of the other “Vs” — variety — comes into play. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop, on the other hand, places no conditions on the structure of the data it can process.

At its core, Hadoop is a platform for distributing computing problems across a number of servers. First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop’s MapReduce involves distributing a dataset among multiple servers and operating on the data: the “map” stage. The partial results are then recombined: the “reduce” stage.

To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A typical Hadoop usage pattern involves three stages:

  • loading data into HDFS,
  • MapReduce operations, and
  • retrieving results from HDFS.

This process is by nature a batch operation, suited for analytical or non-interactive computing tasks. Because of this, Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one.

One of the most well-known Hadoop users is Facebook, whose model follows this pattern. A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for you based on your friends’ interests. Facebook then transfers the results back into MySQL, for use in pages served to users.


The importance of data’s velocity — the increasing rate at which data flows into an organization — has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it’s our turn.

Why is that so? The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumented, generating a data flow back to the provider. Online retailers are able to compile large histories of customers’ every click and interaction: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data.

It’s not just the velocity of the incoming data that’s the issue: it’s possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision. A commercial from IBM makes the point that you wouldn’t cross the road if all you had was a five-minute old snapshot of traffic location. There are times when you simply won’t be able to wait for a report to run or a Hadoop job to complete.

Industry terminology for such fast-moving data tends to be either “streaming data,” or “complex event processing.” This latter term was more established in product categories before streaming processing data gained more widespread relevance, and seems likely to diminish in favor of streaming.

There are two main reasons to consider streaming processing. The first is when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it — hoping hard they’ve not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation.

Product categories for handling streaming data divide into established proprietary products such as IBM’s InfoSphere Streams, and the less-polished and still emergent open source frameworks originating in the web industry: Twitter’s Storm, and Yahoo S4.

As mentioned above, it’s not just about input data. The velocity of a system’s outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. The results might go directly into a product, such as Facebook’s recommendations, or into dashboards used to drive decision-making.

It’s this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of precomputed information. These databases form part of an umbrella category known as NoSQL, used when relational models aren’t the right fit.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.


Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn’t fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application.

Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency.

A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing.

The process of moving from source data to processed application data involves the loss of information. When you tidy up, you end up throwing stuff away. This underlines a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. If you lose the source data, there’s no going back.

Despite the popularity and well understood nature of relational databases, it is not the case that they should always be the destination for data, even when tidied up. Certain data types suit certain classes of database better. For instance, documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient.

Even where there’s not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.

In practice

We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes to deployment there are dimensions to consider over and above tool selection.

Cloud or in-house?

The majority of big data solutions are now provided in three forms: software-only, as an appliance or cloud-based. Decisions between which route to take will depend, among other things, on issues of data locality, privacy and regulation, human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources to supplement in-house deployments.

Big data is big

It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. IT is undergoing an inversion of priorities: it’s the program that needs to move, not the data. If you want to analyze data from the U.S. Census, it’s a lot easier to run your code on Amazon’s web services platform, which hosts such data locally, and won’t cost you time or money to transfer it.

Even if the data isn’t too big to move, locality can still be an issue, especially with rapidly updating data. Financial trading systems crowd into data centers to get the fastest connection to source data, because that millisecond difference in processing time equates to competitive advantage.

Big data is messy

It’s not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place, as Pete Warden observes in his Big Data Glossary: “I probably spend more time turning messy source data into something usable than I do on the rest of the data analysis process combined.”

Because of the high cost of data acquisition and cleaning, it’s worth considering what you actually need to source yourself. Data marketplaces are a means of obtaining common data, and you are often able to contribute improvements back. Quality can of course be variable, but will increasingly be a benchmark on which data marketplaces compete.


The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct. Benefiting from big data means investing in teams with this skillset, and surrounding them with an organizational willingness to understand and use data for advantage.

In his report, “Building Data Science Teams,” D.J. Patil characterizes data scientists as having the following qualities:

  • Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.
  • Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.
  • Storytelling: the ability to use data to tell a story and to be able to communicate it effectively.
  • Cleverness: the ability to look at a problem in different, creative ways.

The far-reaching nature of big data analytics projects can have uncomfortable aspects: data must be broken out of silos in order to be mined, and the organization must learn how to communicate and interpet the results of analysis.

Those skills of storytelling and cleverness are the gateway factors that ultimately dictate whether the benefits of analytical labors are absorbed by an organization. The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.

Know where you want to go

Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then what? Christer Johnson, IBM’s leader for advanced analytics in North America, gives this advice to businesses starting out with big data: first, decide what problem you want to solve.

If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a concrete goal.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Interesting article. First thought popping up. I can imagine that techniques like semantics or artifical intelligence are involved by processing this big data. If so, how far away is judgement day from the movie terminator. Off course I’m not naming a few steps in between like that AI is bound by rules and instructions programmed by humans. But still…. that thought in it’s self is more interesting.

  • Thanks for the great article. I like the bit about sending the program to the data instead of vice versa. I’ve always been interested in how program and data seem to have no hard dividing line in the brain, for example. I’ll be interested to see whether this may start to happen in areas where we’re looking for patterns but don’t even have a decent hypothesis from which to build initial analysis.

    Certainly big data should be quantitively different to the normal sort. I remember being asked what University maths was like and whether it involved really big sums…. It was kind of hard to explain just how different ‘big maths’ was from the school sort :-)

  • Edd – thanks for writing a great summary of Big Data and touching on a number of important themes that we see in data engineering and data science at Think Big Analytics. While Hadoop is batch-oriented today, things are evolving fast and Big Data systems will evolve to fast responses, e.g., with quicker job times for small queries. Larry Feinsmith of JP Morgan Chaseraised the thought provoking question of whether Hadoop and databases will converge.
    The definition of data science is itself a fast-evolving question – is it a question of scientific discipline or domain knowledge? I think data science can be best defined by an intense desire to understand and exploit data and in contrast to the traditional top-down, mathematical, sampling-based approaches of the past.

  • @ Rod,

    “…the thought provoking question of whether Hadoop and databases will converge.”

    This is already happening to the extent that it’s possible given the very different data & file structures involved. Here’s a couple of code snippets of what it looks like:

    Select count(*) from HDFS_data h, GPDB_data g
    where h.key = g.key;

    Insert into HDFS_data select * from GPDB_data;

    (taken from a Greenplum example)

  • Paul

    I think it’s entirely possible that Leo looks very smart in buying Autonomy when he did. Even for WHAT he did.

  • Liked the article. I would like to add that there are other vectors to BigData in addition to the 3V’s. While the 3V’s are better classified as the salient features of the data, the real drivers of the Big Data are technology, economics and the tangible value that can be extracted from the data, in other words the business insights!
    Would love your feedback on the following post http://shhrota.com/2012/01/02/the-big-in-big-data/

  • Fia G

    Great insight Edd. One other Hadoop alternative worth mentioning is HPCC Systems – a mature platform and great fit for all data models. The HPCC Systems platform provides for a data delivery engine together with a data transformation and linking system equivalent to Hadoop. The main advantages over other alternatives are the real-time delivery of data queries and the extremely powerful ECL language programming model. The ROI is significantly better than Hadoop in that it requires less nodes and less programmers. Visit http://hpccsystems.com.

  • Great piece Edd. And finally good to see the industry adopting the volume-variety-velocity construct that Gartner first published 11 years ago. For future reference, here’s the original piece I wrote back then entitled, “Three Dimensional Data Management: Controlling Volume, Velocity and Variety”: https://www.sugarsync.com/pf/D354224_7061872_35276. Since then we’ve recognized and written about other dimensions of Big Data as well. –Doug Laney, VP Research, Gartner. @doug_laney

  • Edd,
    Insightful blog, as usual. Couple of points:

    1. While it is a good idea to start with questions, many times it is better to start in an area and then formulate the questions as you discover the connections and context.
    2. In addition to Velocity, Volume & Variety, I would like to add the Variability, Context and Connectedness to the attributes of Big Data. Slides from my talk on “The Art Of Big Data” at the Naval Post Graduate School, Monterey is at http://www.slideshare.net/ksankar/the-art-of-big-data.


  • See also, http://codingwiththomas.blogspot.com/2011/10/apache-hama-realtime-processing.html if you have a interested in streaming processing.

  • Very interesting article that is extremely relevant as we move in 2012.

    Big data was coined a few years back and is becoming important in our digital world as more info is developed and is needed o be stored.

  • What do i do with big data?

  • Very informative. Possibly the best introductory article I have read on Big Data to date.

  • Interesting article. First thought popping up. I can imagine that techniques like semantics or artifical intelligence are involved by processing this big data. If so, how far away is judgement day from the movie terminator. Off course I’m not naming a few steps in between like that AI is bound by rules and instructions programmed by humans. But still…. that thought in it’s self is more interesting.

  • Ed, great article. The three Vs are simple and do a good job of encompassing the space. Nice Work.

  • If someone had asked me this question, I would have said, that the big data of today is the small data of tomorrow. :)

  • What is interesting is the need to clean the data. One could suggest that this is the next area for job creation. Cleaning data is not technically challenging but it needs human labour. I could suggest that this is where more jobs will be created in a way that Ford created a lot of jobs by the automobile factory line. Instead of hiring small boutiques to do this large scale work could be envisaged.
    This is all speculative, but there is the human capital underbelly to the big data story.
    What is also interesting is to consider what industries, financial, are better suited to this than others, say local government, yet it still has implications. Moreover, the focus on speed raises questions about counter-intuitive strategies such as long sales rather than marginal advantage.
    Finally, if data is stationary, it has feet of clay, does that make the cloud ultimately vulnerable to upload disruption or source disruption? We have not had a serious assualt on big data centres (at least not publicized) but it makes for interesting thinking given the current turn by such groups as Anonymous as well as those seeking to resist those corporations that may benefit from big data.
    Thanks for a great article and the stimulating writing. Well done.

  • Since you made only a brief refererence to XML databases, there are several commercial offerings in additon to the one you mentioned. Oracle, IBM, Versant to name a few. eXist-DB is an opensource native XML database.

  • Re: “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.”

    The author is missing a very important reason why companies are looking for alternatives. First of all, there is no such thing as unstructured data. Any information (with few exceptions), even loosely structured, can be cleaned/transformed/summarized in order to fit into a traditional RDBMS or data warehouse system. But with a traditional database or storage system, doing so is often cost prohibitive (e.g. ridiculous licensing fees, per CPU core, etc). Plus, it’s difficult (and expensive) to scale out as your data needs grow.

  • Peter Schulz

    The act of cleaning data introduces semantic robustness into the underlying dataset. This comes on top of the syntactic structure, that may be well defined, but does not necessarily convey the meaning of the content. To be more intuitive big data, as any other information process, needs to follow an architecture that respects the data (collection of raw data in defined structure, attempting to minimize ambiguity, which will usuallly not be successful) – information (analysis for aggregation, semantic hardening, contextualization) – knowledge (analysis for conclusion and communication) hierarchy. A single step analytic process will not achieve this. Formalization of this approach to industry scale will promote the big data success.

  • Great explanation of big data. But really how many of organizations really need the big data visualization adhering to definition that data our of size and control. One should really be a very large enterprise to really use this data.

    While i understand importance of data and visualization i guess there is a checklist somewhere really to see if you really need it.

    Yeps for sure its where future is heading but not quite yet so SMB’s relax :)…….

  • Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, “big data,” no matter how comprehensive or well analyzed, needs to be complemented by “big judgment.” Big data and big analytics will dramatically amplify the effects of human decisions, sometimes to an unimaginable scale. Our recent work in HBR highlights these issues: http://hbr.org/2012/04/good-data-wont-guarantee-good-decisions/ar/1

  • Great info. Quoted you in our blog today Edd. http://blog.anuesystems.com/anue-big-data-presentation-at-interop/

  • Great Info. I recently looked into Big Data. here is my initial impressions with Big Data and what it is all about