Open science will be a key part of the health data equation

Dr. Stephen Friend on open science and the need for a "GitHub for scientists."

To unlock the potential of health data for the public good, balancing health privacy with innovation will rely on improving informed consent. If the power of big data is to be applied to scientific inquiry in health care, unlocking genetic secrets, finding a cure for breast cancer or “preemptive health care,” changes in scientific culture and technology will both need to occur.

Dr. Stephen FriendOne element of that change could include a health data commons. Another is open access in the research community. Dr. Stephen Friend, the founder of Sage Bionetworks, is one of the foremost advocates of what I think of as “open science.” Earlier in his career, Dr. Friend was a senior vice president at Merck & Co., Inc., where he led the pharmaceutical company’s basic cancer research program.

In a recent interview, Dr. Friend explained what open science means to him and what he’s working on today. For more on the synthesis of open source with genetics, watch Andy Oram’s interview with Dr. Friend and read his series on recombinant research and Sage Congress.

Can you describe what open science means to you and the impact it is having on the world?

Friend: I use two different ways of referring to open science. One is related to who is involved in the project and the other is how they do their work.

For the most part, in biomedical research it’s almost as if in life it’s some giant MMPORG game, where you get designated as a biomedical researcher or a radiologist or patient. When you go to collect data or put a study together, there are unsaid rules about who can have access to what data. We feel as if it should be more about roles earned. If you are a citizen and you get to know your disease very well, then that should allow you to participate in building disease models in a certain way. That’s not because you’re a patient, but due to whatever “level” you’ve achieved.

The other element of open science is how people do that work. This has to do with the fact that we live in a very closed information system. With regard to medicine, it’s probably as tight as China keeps the Internet, maybe tighter, or as tight as Tehran kept cell phone coverage during the Arab Spring. We need to build models to be able to get our hands on lots of data. We need to have many people participating in different roles, to extend that concept of open. We think the data that is out there is very precious, and it needs to be reused and it needs to be widely available.

Are we at a tipping point with open data and open access movements?

Friend: There are two themes that I think have different scores: 1) the technology to support the use of open data and 2) the culture.

Even in the biomedical area, where data is locked up by insurers, locked up by hospitals, locked up by researchers, there’s been significant progress in terms of technology to bring that together. I wouldn’t say it’s by any means all open, but you may know about the Blue Button effort, which makes it so you can get your electronic medical records. This part of the technology for biomedical research should not be seen as a barrier.

With regard to culture, there is still a well-entrenched medical industrial establishment that holds data for the purpose of them being the ones who can make insights. That’s true whether it’s the academic researcher or institution that wants to ensure it can file patents before work is done; a foundation that wants to be the best foundation in disease X and, therefore, wants to have the discoveries coming from that; or new companies that are finding genomic information. There is still a pervasive culture of hoarding data.

Sage Bionetworks develops tools that let patients keep their own data rather than storing it at a particular institution. What has been the uptake or usage of those tools?

Friend: Up until recently, data that was “open” — I’m going to call this “accessible” — was rarely usable. Each time you went and checked it out, like a book, it came in a format where you would start from scratch, basically with raw data, and then curate it. If you talk to the communities that are building models of disease, more than three-quarters of their time is in taking accessible data, making it usable and curating it.

One of the main projects that we’ve had going now for a year-and-a-half is building a compute space, which I like to think of as a geek’s sandbox. It’s called Synapse. It’s basically a GitHub for scientists.

For doing biomedical research, development of algorithms and sharing consequences, we thought there needed to be a way to have workflows for scientists that included the provenance of who had done what and what I think of as “micro attribution.” To go from a world where technology allows more open accessible data to people actually working as groups before publication and speeding it up, we felt there needed to be a way for people to get their credit to be able to post. Synapse is a way to increase teams working with each other in the ways they define.

“GitHub for scientists” is a great metaphor, but there’s quite a bit of activity on GitHub. What’s happening on Synapse?

Friend: I think GitHub has millions of users and projects. Synapse is in beta release. We have a defined set of about 100 users that are trying it out right now. It’s going to be months before it gets into robust numbers. Remember also that the number of biomedical researchers who can work in that space will never be in the millions. I don’t think it ever will get to GitHub scale.

What are you hearing about the balance between health data utility and patient privacy in your discussions within the biomedical community?

Friend: It’s a big, essential area. Virtually all of the data generated today is controlled by the institutions that collected it.

Just to frame how much of an issue it is, consider that virtually every time a clinical study is done or someone gathers data on a patient, internal review boards and HIPAA recommendations look very carefully at what that institution is going to do with the data, whether it’s academic or commercial, because of a real legacy of misuse. There are stringent rules about things that you would think, “Why wouldn’t people want to share those?” down to more nascent information about your genome, your risks at a DNA or a protein level. It basically keeps researchers who want to share data between hospitals from doing it. Similarly, it prevents patients from sharing with each other.

One solution, which I think was initiated by George Church, is the Personal Genome Project (PGP). John Wilbanks, who was a key part of Creative Commons, has been working to completely switch the discussion to who is controlling the data.

A point that Wilbanks makes came out of the PGP effort: what would happen if you gave the control to citizens? What would happen if you had a system where the patient, when a sample was taken, said, “That is part of me you’re taking. I should have the right to that data. And when I have the data, I’d like to share it in this mode, meaning ‘aggressively,’ or I want to share it only with a set of people who I highly trust.”

If you shift control from the institution to citizens, that becomes something that has been embraced, I think, by the institutional review board world. The particular tool that has been involved is called Portable Legal Consent (PLC). You can learn about it at

Does the Supreme Court decision on the Patient Protection and Affordable Care Act offer some protections to people in terms of access to insurance, despite preexisting conditions that might be shown in genetic data?

Friend: I think the answer is “yes” because you used the term some protection, not “solved.” There are two points to make here.

One is that I think we need to be careful. There are lots of things that haven’t been dreamed of that people could do with personal information. We are entering a world where entire swathes of our lives are tracked by others. I think a really good point that Jamie Heywood at PatientsLikeMe has made is that, for the most part, patients and citizens have no idea what insurance companies or industry or institutions are actually doing with their data. That needs to see the light of day as it is.

The second point is that what I thought I would never tell someone 10 to 20 years ago has really been opened up by social media. That is, the concepts of where privacy is and what you are comfortable sharing have changed.

To make it very stark, when you talk to someone who has had five or six therapies in a row that have failed, or if you talk to someone who has Parkinson’s and is looking at months left of life, they’ll sit there in the ICU and they will laugh at you when you ask, “Are you worried about privacy?” They say, “I have no privacy. Everyone has seen every part of my body. I’ve lost all. What I want is someone else to benefit — not me to benefit from my information, but someone else.”

One of the projects we’re doing now, called the Real Names Discovery Pilot, is one where a person’s real name, their entire genome sequence, all of the molecular and clinical data about them, and data about how they’re doing every day, are going to be put up on the web as a longitudinal cohort study, for anyone in the world to get to.

I think we have to stretch the boundaries and see what happens.

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting health care.

Save 20% on registration with the code RADAR20

tags: , , , , ,