- The Parable of Google Flu (PDF) — We explore two
issues that contributed to [Google Flu Trends]’s mistakes—big data hubris and algorithm dynamics—and offer lessons for moving forward in the big data age. Overtrained and underfed?
- Principles of Good Data Analysis (Greg Reda) — Once you’ve settled on your approach and data sources, you need to make sure you understand how the data was generated or captured, especially if you are using your own company’s data. Treble so if you are using data you snaffled off the net, riddled with collection bias and untold omissions. (via Stijn Debrouwere)
"data analysis" entries
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
A new operator from the magrittr package makes it easier to use R for data analysis.
In every data analysis, you have to string together many tools. You need tools for data wrangling, visualisation, and modelling to understand what’s going on in your data. To use these tools effectively, you need to be able to easily flow from one tool to the next, focusing on asking and answering questions of the data, not struggling to jam the output from one function into the format needed for the next. Wouldn’t it be nice if the world worked this way! I spend a lot of my time thinking about this problem, and how to make the process of data analysis as fast, effective, and expressive as possible. Today, I want to show you a new technique that I’m particularly excited about.
R, at its heart, is a functional programming language: you do data analysis in R by composing functions. However, the problem with function composition is that a lot of it makes for hard-to-read code. For example, here’s some R code that wrangles flight delay data from New York City in 2013. What does it do? Read more…
Addressing in-memory limitations and scalability issues of R.
The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.
One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…
Edge contributors say it's time to retire the search for one-size-fits-all answers.
The 2014 Edge Annual Question (EAQ) is out. This year, the question posed to the contributors is: What scientific idea is ready for retirement?
As usual with the EAQ, it provokes thought and promotes discussion. I have only read through a fraction of the responses so far, but I think it is important to highlight a few Edge contributors who answered with a common, and in my opinion a very important and timely, theme. The responses that initially caught my attention came from Laurence Smith (UCLA), Gavin Schmidt (NASA), Guilio Boccaletti (The Nature Conservancy) and Danny Hillis (Applied Minds). If I were to have been asked this question, my contribution for idea retirement would likely align most closely with these four responses: Smith and Boccaletti want to see same idea disappear — stationarity; Schmidt’s response focused on the abolition of simple answers; and Hillis wants to do away with cause-and-effect.
Big Data and analytics are the foundation of personalized medicine
Despite considerable progress in prevention and treatment, cancer remains the second leading cause of death in the United States. Even with the $50 billion pharmaceutical companies spend on research and development every year, any given cancer drug is ineffective in 75% of the patients receiving it. Typically, oncologists start patients on the cheapest likely chemotherapy (or the one their formulary suggests first) and in the 75% likelihood of non-response, iterate with increasingly expensive drugs until they find one that works, or until the patient dies. This process is inefficient and expensive, and subjects patients to unnecessary side effects, as well as causing them to lose precious time in their fight against a progressive disease. The vision is to enable oncologists to prescribe the right chemical the first time–one that will kill the target cancer cells with the least collateral damage to the patient.
How data can improve cancer treatment
Big data is enabling a new understanding of the molecular biology of cancer. The focus has changed over the last 20 years from the location of the tumor in the body (e.g., breast, colon or blood), to the effect of the individual’s genetics, especially the genetics of that individual’s cancer cells, on her response to treatment and sensitivity to side effects. For example, researchers have to date identified four distinct cell genotypes of breast cancer; identifying the cancer genotype allows the oncologist to prescribe the most effective available drug first.
Herceptin, the first drug developed to target a particular cancer genotype (HER2), rapidly demonstrated both the promise and the limitations of this approach. (Among the limitations, HER2 is only one of four known and many unknown breast cancer genotypes, and treatment selects for populations of resistant cancer cells, so the cancer can return in a more virulent form.)
Increasingly available data spurs organizations to make analysis easier
Genomics is making headlines in both academia and the celebrity world. With intense media coverage of Angelina Jolie’s recent double mastectomy after genetic tests revealed that she was predisposed to breast cancer, genetic testing and genomics have been propelled to the front of many more minds.
In this new data field, companies are approaching the collection, analysis, and turning of data into usable information from a variety of angles.
What business leaders need to know about data and data analysis to drive their businesses forward.
Foster and Tom have a long history of applying data to practical business problems. Their book, which evolved into Data Science for Business, was different from all the other data science books I’ve seen. It wasn’t about tools: Hadoop and R are scarcely mentioned, if at all. It wasn’t about coding: business students don’t need to learn how to implement machine learning algorithms in Python. It is about business: specifically, it’s about the data analytic thinking that business people need to work with data effectively.
Data analytic thinking means knowing what questions to ask, how to ask those questions, and whether the answers you get make sense. Business leaders don’t (and shouldn’t) do the data analysis themselves. But in this data-driven age, it’s critically important for business leaders to understand how to work with the data scientists on their teams. Read more…
As society becomes increasingly data driven, it's critical to remember big data isn't a magical tool for predicting the future.
If you eat ice cream, you’re more likely to drown.
That’s not true, of course. It’s just that both ice cream and swimming happen in the summer. The two are correlated — and ice cream consumption is a good predictor of drowning fatalities — but ice cream hardly causes drowning.
These kinds of correlations are all around us, and big data makes them easy to find. We can correlate childhood trauma with obesity, nutrition with crime rates, and how toddlers play with future political affiliations.
Just as we wouldn’t ban ice cream in the hopes of preventing drowning, we wouldn’t preemptively arrest someone because their diet wasn’t healthy. But a quantified society, awash in data, might be tempted to do so because overwhelming correlation looks a lot like causality. And overwhelming correlation is what big data does best.
It’s getting easier than ever to find correlations. Parallel computing, advances in algorithms, and the inexorable crawl of Moore’s Law have dramatically reduced how much it costs to analyze a data set. Consider an activity we do dozens of times a day, without thinking: a Google search. The search is farmed out to thousands of machines, and often returns hundreds of answers in less than a second. Big data might seem esoteric, but it’s already here. Read more…