"Big Data Culture" entries

Forecasting events, from disease outbreaks to sales to cancer research

The O'Reilly Data Show Podcast: Kira Radinsky on predicting events using machine learning, NLP, and semantic analysis.

Editor’s note: One of the more popular speakers at Strata + Hadoop World, Kira Radinsky was recently profiled in the new O’Reilly Radar report, Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education.

When I first took over organizing Hardcore Data Science at Strata + Hadoop World, one of the first speakers I invited was Kira Radinsky. Radinsky had already garnered international recognition for her work forecasting real-world events (disease outbreak, riots, etc.). She’s currently the CTO and co-founder of SalesPredict, a start-up using predictive analytics to “understand who’s ready to buy, who may buy more, and who is likely to churn.”

I recently had a conversation with Radinsky, and she took me through the many techniques and subject domains from her past and present research projects. In grad school, she helped build a predictive system that combined newspaper articles, Wikipedia, and other open data sets. Through fine-tuned semantic analysis and NLP, Radinsky and her collaborators devised new metrics of similarity between events. The techniques she developed for that predictive software system are now the foundation of applications across many areas. Read more…

Closing the gender gap in tech

Stories from women who are making a big impact on the field of big data.

The gender gap in tech is not news, but here’s what is: it’s shrinking. In O’Reilly’s latest report — Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education — female data practitioners discuss their work, their achievements, and the attitudes that have propelled them forward to career success.

Through a series of 15 interviews with women across the data field, we’ve uncovered stories we think you’ll find and both interesting and inspiring. The interviews explore:

  • Interviewees’ views about opportunities for women in the fields of science, technology, engineering, and math (STEM)
  • Benefits of the data field as a career choice for women
  • The changing attitudes of Millennials toward women working in data
  • Remedies for continuing to close the gender gap in tech

Our findings reveal an important consensus among the women we interviewed — the role of female mentors and role models working in STEM is extremely important for opening up the pathway for more women to enter these fields. In fact, the impact that mentors have had on our interviewees has inspired many of them to serve as mentors to other female colleagues, and younger generations of girls, today. Read more…

Now available: Big Data Now, 2014 edition

Our wrap-up of important developments in the big data field.

In the four years we’ve been producing Big Data Now, our wrap-up of important developments in the big data field, we’ve seen tools and applications mature, multiply, and coalesce into new categories. This year’s free wrap-up of Radar coverage is organized around seven themes:
  • Cognitive augmentation: As data processing and data analytics become more accessible, jobs that can be automated will go away. But to be clear, there are still many tasks where the combination of humans and machines produce superior results.
  • Intelligence matters: Artificial intelligence is now playing a bigger and bigger role in everyone’s lives, from sorting our email to rerouting our morning commutes, from detecting fraud in financial markets to predicting dangerous chemical spills. The computing power and algorithmic building blocks to put AI to work have never been more accessible.
  • Read more…

A brief look at data science’s past and future

In this O'Reilly Data Show Podcast: DJ Patil weighs in on a wide range of topics in data science and big data.

Back in 2008, when we were working on what became one of the first papers on big data technologies, one of our first visits was to LinkedIn’s new “data” team. Many of the members of that team went on to build interesting tools and products, and team manager DJ Patil emerged as one of the best-known data scientists. I recently sat down with Patil to talk about his new ebook (written with Hilary Mason) and other topics in data science and big data.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

Here are a few of the topics we touched on:

Proliferation of programs for training and certifying data scientists

Patil and I are both ex-academics who learned learned “data science” in industry. In fact, up until a few years ago one acquired data science skills via “on-the-job training.” But a new job title that catches on usually leads to an explosion of programs (I was around when master’s programs in financial engineering took off). Are these programs the right way to acquire the necessary skills? Read more…

Becoming data driven

DJ Patil and Hilary Mason's Data Driven: Creating a Data Culture is about building organizations that can take advantage of data.

I’m excited to see that DJ Patil and Hilary Mason‘s new ebook Data Driven: Creating a Data Culture is now available. It’s been a lot of fun working with DJ and Hilary over the past few months.

I’m not going to summarize their work here: you should read it. It’s based on the realization that merely assembling a bunch of people who understand statistics doesn’t do the job. You end up with a group of data specialists on the margins of the organization, who don’t have the ability to do anything more than be frustrated. If you don’t develop a data culture, if people don’t understand the value of data and how it can be used to inform discussions, you can build all the dashboards and Hadoop clusters you want, but they won’t help you.

Data is a powerful tool, but it’s easy to jump on the data bandwagon and miss the benefits. Data Driven: Creating a Data Culture is about building organizations that can really take advantage of data. Is that organization yours? Read more…

Top keynotes at Strata Conference and Strata + Hadoop World 2014

From data privacy to real-world problem solving, O’Reilly’s data editors highlight the best of the best talks from 2014.

2014 was a year of tremendous growth in the field of data, as it was as well for Strata and Strata + Hadoop World, O’Reilly’s and Cloudera’s series of data conferences. At Strata, keynotes, individual sessions, and tracks like Hardcore Data Science, Hadoop and Beyond, Data-Driven Business Day, and Design & Interfaces, among others, explore the cutting-edge aspects of how to gather, store, wrangle, analyze, visualize, and make decisions with the vast amounts of data on our hands today. Looking back on the past year of Strata, the O’Reilly data editors chose our top keynotes from Strata Santa Clara, Strata Barcelona, and Strata + Hadoop World NYC.

It was tough to winnow the list down from an exceptional set of keynotes. Visit the O’Reilly YouTube channel for a larger set of 2014 keynotes, or Safari for videos of the keynotes and many of the conference sessions.

Best of the best

  • Julia Angwin reframes the issue of data privacy as justice, due process, and human rights (and her account of trying to buy better privacy goods and services is both instructive and funny).

  • Read more…

The promise and problems of big data

A look at the social and moral implications of living in a deeply connected, analyzed, and informed world.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. You can download the free report here.

We’ll now look at both the light and the shadows of this new dawn, the social and moral implications of living in a deeply connected, analyzed, and informed world. This is both the promise and the peril of big data in an age of widespread sensors, fast networks, and distributed computing.

Solving the big problems

The planet’s systems are under strain from a burgeoning population. Scientists warn of rising tides, droughts, ocean acidity, and accelerating extinction. Medication-resistant diseases, outbreaks fueled by globalization, and myriad other semi-apocalyptic Horsemen ride across the horizon.

Can data fix these problems? Can we extend agriculture with data? Find new cures? Track the spread of disease? Understand weather and marine patterns? General Electric’s Bill Ruh says that while the company will continue to innovate in materials sciences, the place where it will see real gains is in analytics.

It’s often been said that there’s nothing new about big data. The “iron triangle” of Volume, Velocity, and Variety that Doug Laney coined in 2001 has been a constraint on all data since the first database. Basically, you could have any two you want fairly affordably. Consider:

  • A coin-sorting machine sorts a large volume of coins rapidly, but assumes a small variety of coins. It wouldn’t work well if there were hundreds of coin types.
  • A public library, organized by the Dewey Decimal System, has a wide variety of books and topics, and a large volume of those books — but stacking and retrieving the books happens at a slow velocity.

What’s new about big data is that the cost of getting all three Vs has become so cheap it’s almost not worth billing for. A Google search happens with great alacrity, combs the sum of online knowledge, and retrieves a huge variety of content types. Read more…

New opportunities in the maturing marketplace of big data components

The evolving marketplace is making new data applications and interactions possible.

Editor’s note: this is an excerpt from our new report Data: Emerging Trends and Technologies, by Alistair Croll. Download the free report here.

Here’s a look at some options in the evolving, maturing marketplace of big data components that are making the new applications and interactions we’ve been looking at possible.

Graph theory

First used in social network analysis, graph theory is finding more and more homes in research and business. Machine learning systems can scale up fast with tools like Parameter Server, and the TitanDB project means developers have a robust set of tools to use.

Are graphs poised to take their place alongside relational database management systems (RDBMS), object storage, and other fundamental data building blocks? What are the new applications for such tools?

Inside the black box of algorithms: whither regulation?

It’s possible for a machine to create an algorithm no human can understand. Evolutionary approaches to algorithmic optimization can result in inscrutable, yet demonstrably better, computational solutions.

If you’re a regulated bank, you need to share your algorithms with regulators. But if you’re a private trader, you’re under no such constraints. And having to explain your algorithms limits how you can generate them.

As more and more of our lives are governed by code that decides what’s best for us, replacing laws, actuarial tables, personal trainers, and personal shoppers, oversight means opening up the black box of algorithms so they can be regulated.

Years ago, Orbitz was shown to be charging web visitors who owned Apple devices more money than those visiting via other platforms, such as the PC. Only that’s not the whole story: Orbitz’s machine learning algorithms, which optimized revenue per customer, learned that the visitor’s browser was a predictor of their willingness to pay more. Read more…

2014 Data Science Salary Survey

Salary insights from more than 800 data professionals reveal a correlation to skills and tools.

Data is growing: Whether in terms of data-driven applications, the diversity of tools or the actual quantities of data we collect and process, the data space is characterized by expansion. The excitement around data has been tempered in some circles — the first two query completion suggestions for a Google search of “Is data science” are “dead” and “a fad” — but from a practitioner’s perspective, things are looking quite rosy.

In the results of this year’s O’Reilly Media Data Science Salary Survey, we found a median total salary of $98k ($144k for US respondents only). The 816 data professionals in the survey included engineers, analysts, entrepreneurs, and managers (although almost everyone had some technical component in their role).

Why the high salaries? While the demand for data applications has increased rapidly, the number of people who set up the systems and perform advanced analytics has increased much more slowly. Newer tools such as Hadoop and Spark should have even fewer expert users, and correspondingly we found that users of these tools have particularly high salaries. Read more…

Privacy is a concept, not a regime

In this O'Reilly Radar Podcast: Dr. Gilad Rosner talks about data privacy, and Alasdair Allan chats about the broken IoT.

In this podcast episode, I catch up with Dr. Gilad Rosner, a visiting researcher at the Horizon Digital Economy Research Institute in England. Rosner focuses on privacy, digital identity, and public policy, and is launching an Internet of Things Privacy Forum. We talk about personal data privacy in the age of the Internet of Things (IoT), privacy as a social characteristic, an emerging design ethos for technologists, and whether or not we actually own our personal data. Rosner characterizes personal data privacy as a social construct and addresses the notion that privacy is dead:

“Firstly, it’s important to recognize the idea that privacy is not a regime to control information. Privacy is a much larger concept than that. Regimes to control information are ways that we as a society preserve privacy, but privacy itself emerges from social needs and from individual human needs. The idea that privacy is dead comes from the vulnerability that people are feeling because they can see that it’s very difficult to maintain walls between their informational spheres, but that doesn’t mean that there aren’t countercurrents to that, and it doesn’t mean that there aren’t ways, as we go forward, to improve privacy preservation in the electronic spaces that we continue to move into.”

Subscribe to the O’Reilly Radar Podcast

iTunes, SoundCloud, RSS

As we move more and more into these electronic spaces and the Internet of Things becomes democratized, our notions of privacy are shifting on a cultural level beyond anything we’ve experienced as a society before. Read more…