The year in big data and data science

Big data and data science have both been with us for a while. According to McKinsey & Company’s May 2011 report on big data, back in 2009 “nearly all sectors in the U.S. economy had at least an average of 200 terabytes of stored data … per company with more than 1,000 employees.” And on the data-science front, Amazon’s John Rauser used his presentation at Strata New York (below) to trace the profession of data scientist all the way back to 18th-century German astronomer Tobias Mayer.

Of course, novelty and growth are separate things, and in 2011, there were a number of new technologies and companies developed to address big data’s issues of storage, transfer, and analysis. Important questions were also raised about how the growing ranks of data scientists should be trained and how data science teams should be constructed.

With that as a backdrop, below I take a look at three evolving data trends that played an important role over the last year.

The ubiquity of Hadoop

It was a big year for investment for Apache Hadoop-based companies. Hortonworks, which was spun out of Yahoo this summer, raised $20 million upon its launch. And when Cloudera announced it had raised $40 million this fall, GigaOm’s Derrick Harris calculated that, all told, Hadoop-based startups had raised $104.5 million between May and November of 2011. (Other startups raising investment for their Hadoop software included Platfora Hadapt and MapR.)

But it wasn’t just startups that got in on the Hadoop action this year: IBM announced this fall that it would offer Hadoop in the cloud; Oracle unveiled its own Hadoop distribution running on its new Big Data appliance; EMC signed a licensing agreement with MapR; and Microsoft opted to put its own big data processing system, Dryad, on hold, signing a deal instead with Hortonworks to handle Hadoop on Azure.

The growing number of Hadoop providers and adopters has spurred more solutions for managing and supporting Hadoop. This will become increasingly important in 2012 as Hadoop moves beyond the purview of data scientists to become a tool more businesses and analysts utilize.

More data, more privacy and security concerns

Despite all the promise that better tools for handing and analyzing data holds, there were numerous concerns this year about the privacy and security implications of big data, stemming in part from a series of high-profile data thefts and scandals.

In April, a security breach at Sony led to the theft of the personal data of 77 million users. The intrusion into the Playstation Network prompted Sony to pull it offline, but Sony failed to notify its users about the issue for a full week (later admitting that it stored usernames and passwords unencrypted). Estimates of the cost of the security breach to Sony: between $170 million and $24 billion.

That’s a wide range of estimates for the damage done to the company, but the point is clear nonetheless: not only do these sorts of data breaches cost companies millions, but the value of consumers’ personal data is also increasing — for both legitimate and illegitimate purposes.

Sony was hardly the only company with security and privacy concerns on its hands. In April, Alasdair Allan and Pete Warden uncovered a file in Apple iOS software that noted users’ latitude-longitude coordinates along with a timestamp. Apple responded, insisting that the company “is not tracking the location of your iPhone. Apple has never done so and has no plans to ever do so.” Apple fixed what it said was a “bug.”

Late this year, almost all handset makers and carriers were implicated by another mobile concern when Android developer Trevor Eckhart reported that the mobile intelligence company Carrier IQ’s rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

That the data from mobile technology was at the heart of these two controversies reflects in some ways our changing data usage patterns. But whether it’s mobile or not, as we do more online — shop, browse, chat, check in, “like” — it’s clear that we’re leaving behind an immense trail of data about ourselves. This year saw the arrival of several open-source efforts, such as the Locker Project and ThinkUp, that strive to give users better control over their personal social data.

And while better control and safeguards can offer some level of protection, it’s clear that technology can always be cracked and the goals of data aggregators can shift. So, if digital data is and always will be a moving target, how does that shape our expectations for privacy? In Privacy and Big Data, published this year, co-authors Terence Craig and Mary Ludloff argued that we might be paying too much attention to concerns about “intrusions of privacy” and that instead we need to be thinking about better transparency with how governments and companies are using our data.

Open data’s inflection point

Screenshot from the Open Knowledge Foundation’s Open Government Data Map.

When it comes to better transparency, 2011 has been a good year for open data, with strong growth in the number of open data efforts. Canada, the U.K., France, the U.S., and Kenya were a few of the countries unveiling open data initiatives.

There were still plenty of open data challenges: budgets cuts, for example, threatened the U.S. Data.gov initiative. And in his “state of open data 2011″ talk, open data activist David Eaves pointed to the challenges of having different schemas and few standards, making it difficult for some datasets to be used across systems and jurisdictions.

Even with a number of open data “wins” at the government level, a recent survey of the data science community by EMC named the lack of open data as one of the obstacles that data scientists and business intelligence analysts said they faced. Just 22% of the former and 12% of the latter said that they “strongly believed” that the employees at their companies have the access they need to run experiments on data. Arguably, more open data efforts have spawned more interest and better understanding of what this can mean.

The demands for more open data has also spawned a demand for more tools. Importantly, these tools are beginning to be open to more than just data scientists or programmers. They include things like visualization-creator Visual.ly, the scraping tool ScraperWiki, and data-sharing site BuzzData.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

The year in big data and data science

Hadoop, security and open data defined the data world in 2011.

The ubiquity of Hadoop

More data, more privacy and security concerns

Open data’s inflection point

Get the O’Reilly Data Newsletter