Building data startups: Fast, big, and focused

Low costs and cloud tools are empowering new data startups.

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it’s at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack

As the foundational layer in the big data stack, the cloud provides the scalable persistence and compute power needed to manufacture data products.

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let’s take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing’s brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google’s BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what’s alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers’ credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren’t limited to startups: LinkedIn’s People You May Know and FourSquare’s Explore feature enhance engagement of their companies’ core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.


tags: , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Alex Tolley

    Shouldn’t you also be addressing the exponential growth of data and how it compares with the others? Is it rising slower or faster than these others, thus implying that technology is closing or getting left behind with the sheer volumes produced?

    Or is it a case of data filling teh available storage and processing capability?

  • Partha Ghosh

    What is oft important is not the size of the data, that’s not an engineer’s approach. Rather how much of the big data (sampling) is truly required for useful inferences. That can easily bring down the complexity by nearly two orders of magnitude!

  • I see those exponential trends only continuing throughout the next several decades. I think connectivity will continue to proliferate, and the cost of data storage and bandwidth will continue to decrease. With these trends comes also an increase of available data, which must be sorted through in a logical and robust way, which is where these data start-ups find their niche.

  • Great recap, Michael. I’ve seen earlier iterations of this breakdown in other blogposts that you’ve written.

    Without disagreeing with you on anything, I still feel like there’s a bigger picture to putting the data entrepreneurship opportunity, and that’s around the ongoing effort to drag data kicking and screaming from being an ephemeral thing to a first class media type. When data is an object in the social graph, and gains the general appreciation that music, video and documents enjoy a lot of fundamental assumptions must be rechecked. Suddenly data isn’t the exclusive realm of quants and analysts, it’s been upgraded to something that even non-technical people can understand and potentially even interact with.

    A great example of this is Wikipedia and Google. It used to be that if you wanted to look up an encyclopedia article, you’d go to the Brittanica site and call up an article. It certainly wasn’t editable and nobody could easily improve upon it.

    Yet the perfect storm of Wikipedia giving each article a meaningful URL and Google facilitating search meant that suddenly Wikipedia articles are the top result of most searches on Wikipedia. is our attempt to replicate the trajectory of Wikipedia with data. We give dataset publishers the opportunity to curate a community of people that are interested in and often working with their data. In this regards, we are trying to be the glue that ties people working on data, regardless of where in the diagram they are standing.

    Try it out… I’d love to know what you think so far!

  • Robert Ames

    Nice article, and great depiction of the layers. But if the cost of data access is increasing exponentially, how would that define a move to the cloud? Moving data across WANs is now relatively very expensive, not just for mobile users, but for corporations. So “dumping” your big and fast data into a cloud, where you are continually paying premiums for access even if the storage is cheap, could be prohibitively expensive. @robertoames

  • On the subject of democratizing big data, something that troubles me with the term “big data” is that it reminds me of “long string” in the “how long is a piece of string?” sense. How big is big? The edge cases are very clear. Google, facebook, this is big data. A retailer with 3 stores and no web site is not big data. But importantly the vast majority of companies fall between these extremes. Many smaller businesses now capture machine-generated data in additional to old-school transactions – and volumes are exploding as a result. While this data might not be “big” from a computer science perspective, it feels pretty big to the people who are trying to understand what it means, and in particular the association between the transactional and machine-generated data.

    From a technology perspective, this neglected majority doesn’t need the typical tools of big data like Hadoop, can’t afford Netezza and wouldn’t remember the math to use SPSS or SAS. Fortunately, there is an additional category of analytics startups emerging to address these customer needs with simple, elegant analytics solutions based on recent technologies combining columnar storage, in-memory, and compression to blast their way through hundreds of millions of records almost instantly using hardware you can pick up in Best Buy. One example is sisense.

  • Michael, nice article. Would be nice to hear your take on where 1010data fits into the mix.

  • i like IEEE Computer Science.