Not just big data, but better data

I was honored to chair O’Reilly’s inaugural edition of Strata Rx, our conference on data science in health care, this past October along with Colin Hill. As we’re beginning to plan this year’s event, I find myself thinking a lot about a theme that emerged from some of the keynotes last fall: in order to solve the problems we’re facing in health care — to lower costs and provide more personal, targeted treatments to patients — we don’t just need more data; we need better data.

Much has been made about the era of big data we find ourselves in. But though the data we collect is straining the limits of our tools and models, we’re still not making the kind of headway we hoped for in areas like health care. So big data isn’t enough. We need better data.

What does it mean to have better data in health care? Here are some things on my list; perhaps you can think of others.

1. Better data is complete

This may seem obvious, but there is a lot of data that never sees the light of day for one reason or another. That means that the data we do see is only a partial representation of what we know. And incomplete data can be worse than no data at all: It makes us think we know something when we don’t. It gives us false confidence, and introduces bias. Better data tells us something about what doesn’t seem to work, not just about what does.

Ben Goldacre spoke well and passionately about this at the Strata Conference in London last October.

2. Better data answers the right questions

Again, this may seem obvious, but if the data you collect doesn’t address the questions you want to answer (or doesn’t provide adequate or appropriate context), then it’s not helping. In health care, we use statistical modeling on the data we collect to mine the past and predict the future, but we often forget that each patient is a full human being with all of the cultural, habitual, and emotional variables that brings with it. We are not just trying to predict and change the behaviors of viruses and mutant genes, but the behaviors of people. For this purpose, better data is data that accounts for all of the influences and factors that might impact the outcome.

For more on this, see this piece by Ravi Krishnan of Kaiser Permanente.

3. Better data is reproducible

Jamie Heywood of PatientsLikeMe described on stage at Strata Rx his efforts to reproduce various studies he thought showed promise to help treat his brother’s ALS. Despite doing the experiments more thoroughly and with more animal subjects than the literature, Jamie was unable to reproduce a single result.

His proposed solution to this problem — to throw out all the data we have now and start over completely — is a radical one. But his point stands: better data is data that can be independently confirmed and reproduced.

4. Better data is available in standardized formats

There’s an old engineering joke: “Standards are great! Everyone should have one.” Well, that’s the situation we’ve got in health care, particularly when it comes to electronic health records (EHRs), but also when it comes to research. Every solution provider has their own standard, and believes the rest of the system should adopt their platform. Standards are a hard problem. But if we want better data in order to solve our other, harder problems, then we need to crack this one. Better data is data that can be shared and read by others easily (even years after it was collected), and can also be combined with datasets from other sources.

By the way, it’s not just formats that need to be standardized. Field names can wreak havoc on data, too. To understand what I mean, just watch this:

5. Better data is (sometimes) visual

Okay, so this is really a point about how data is analyzed, and not necessarily about the data itself. But it’s important enough that I wanted to include it here anyway. You may be familiar with the classic example of Anscombe’s Quartet, which is a collection of four sets of data with the same statistical average; mathematically, they seem identical. But as soon as you graph them, you can see right away that they represent very different phenomena.

Anscombe’s Quartet via Wikimedia Commons

Again, although this is a point about how we analyze and try to understand our data, it goes to how we can get the most out of the data we collect, even if there’s not a lot of it. Better data is data that helps us understand what’s really going on, and sometimes it needs to be in a visual form in order to do that.

What else should we be mindful of when collecting and using data? What else can help us gather better data, and not just more data? Please leave your thoughts in the comments below. And please join us for Strata Rx 2013, which will be held September 25-27, 2013, at the Boston Marriott Copley Place in Boston, MA. The call for proposals will open in late February, and registration will open in early May.