Strata Week: The looming data science talent shortage

EMC study looks at the state of data science, Carrier IQ and big data, and the welcome return of old tweets.

Here are a few of the big-data stories that caught my attention this week.

Data scientists in demand

This week, EMC released (pdf) the findings of its recent survey of the data science community. Calling it the largest ever survey of its kind, the EMC Data Science Study included responses from more than 500 data scientists, information analysts, and data specialists from the U.S., U.K., France, Germany, India and China.

The majority of respondents (83%) said they believed that new technologies would increase the need for data scientists. But 64% also felt as though this new demand for data scientists would outstrip the supply (31% said demand would “significantly outpace” supply). Just 12% felt as though future data science jobs would be filled by current business intelligence professionals.

Chart from Data Science Revealed studyThe source for future talent? College students, not surprisingly — 34% said future data science jobs would go to computer science grads; 24% said these jobs would go to those from other disciplines. And in the case of data scientists, those may well be college students with masters or PhDs — some 40% of data scientists have an advanced degree, and nearly one in 10 have a doctorate. In comparison, less than 1% of business intelligence professionals have a PhD.

But the problems that the data science community faces aren’t simply a future talent shortage. Just a third of respondents said they were confident in their company’s ability to make data-driven business decisions. Again, respondents pointed to a shortage of employees with the right training or skills (32%). Budget shortages were also an issue (32%).

Another problem uncovered by the survey: data accessibility. Just 12% of business intelligence analysts and 22% of data scientists say they “strongly believe” that employees have the access they need to run experiments on data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Carrier IQ and big data

The mobile intelligence company Carrier IQ has gone from obscurity to infamy following the discovery by Android developer Trevor Eckhart that Carrier IQ’s rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

The software is on an estimated 100 million phones — Android and iOS alike — and the news of it has prompted calls for an FTC investigation, questions from a Senator, and class-action lawsuits.

Carrier IQ issued a statement, explaining that “Our software makes your phone better by delivering intelligence on the performance of mobile devices and networks to help the operators provide optimal service efficiency.”

But at GigaOm, Kevin Fitchard called Carrier IQ’s relationships to handset makers and carriers a “bizarre big-data triangle”:

This is big data for the mobile world — massive databases of consumer behavior delving into when, how and in what manner we use our devices. By Carrier IQ’s own admission, its software is embedded in more than 150 million handsets. There are plenty of companies that would find that information enormously useful. The problem is Carrier IQ never got permission from all these smartphone users to collect that data, never told them it was gathering it, and never provided a way of opting out.

DataSift will soon offer access to historical tweets

DataSift Historical DataIt was April of last year when Twitter announced it was donating its entire archive to the Library of Congress, and since then, researchers have been waiting to get their hands on this older Twitter data.

As it currently stands, you can only search Twitter back as far as a week. And while you can get access to the Twitter firehose, that’s little help at looking at the historical record.

But starting soon, developers and researchers will have access to a bit more of that record when DataSift begins offering historical data. DataSift’s alpha version will offer access to 60 days’ worth of the Twitter feed, and when the service formally launches next year, DataSift promises more data.

It’s not quite the Library of Congress, which, as we noted earlier this year, is working on the technology infrastructure to make the historical Tweets indexable and accessible. The Library of Congress does have access to the Twitter firehose (via the other stream provider, Gnip), so it looks like that’s where the complete record will, for now at least, reside.

Got data news?

Feel free to email me.


tags: , , , , , ,