Data sharing drives diagnoses and cures, if we can get there (part 2)

How the field of genetics is using data within research and to evaluate researchers

Editor’s note: Earlier this week, Part 1 of this article described Sage Bionetworks, a recent Congress they held, and their way of promoting data sharing through a challenge.

Data sharing is not an unfamiliar practice in genetics. Plenty of cell lines and other data stores are publicly available from such places as the TCGA data set from the National Cancer Institute, Gene Expression Omnibus (GEO), and Array Expression (all of which can be accessed through Synapse). So to some extent the current revolution in sharing lies not in the data itself but in critical related areas.

First, many of the data sets are weakened by metadata problems. A Sage programmer told me that the famous TCGA set is enormous but poorly curated. For instance, different data sets in TCGA may refer to the same drug by different names, generic versus brand name. Provenance–a clear description of how the data was collected and prepared for use–is also weak in TCGA.

In contrast, GEO records tend to contain good provenance information (see an example), but only as free-form text, which presents the same barriers to searching and aggregation as free-form text in medical records. Synapse is developing a structured format for presenting provenance based on the W3C’s PROV standard. One researcher told me this was the most promising contribution of Synapse toward the shared used of genetic information.

Data can also be inaccessible to researchers because it reflects the diversity of patient experiences. One organizer of Army of Women, an organization that collects information from breast cancer patients, say it’s one of the largest available data repositories for this disease, but is rarely used because researchers cannot organize it.

Fragmentation in the field of genetics extends to nearly everything that characterizes data. One researcher told me about his difficulties combining the results of two studies, each comparing responses of the same genetic markers to the same medications, because the doses they compared were different.

The very size of data is a barrier. One speaker surveyed all the genotypic information that we know plays a role in creating disease. This includes not only the patient’s genome–already many gigabytes of information–but other material in the cell and even the parasitic bacteria that occupy our bodies. All told, he estimated that a complete record of our bodies would require a yottabyte of data, far beyond the capacity of any organization to store.

Synapse tries to make data easier to reuse by encouraging researchers to upload the code they use to manipulate the data. Still, this code may be hard to understand and adapt to new research. Most researchers learn a single programming language such as R or MATLAB and want only code in that language, which in turn restricts the data sets they’re willing to use.

Sage has clearly made a strategic choice here to gather as much data and code as possible by minimizing the burden on the researcher when uploading these goods. That puts more burden on the user of the data and code to understand what’s on Synapse. A Sage programmer told me that many sites with expert genetics researchers lack programming knowledge. This has got to change.

Measure your words

Standardized data can transform research far beyond the lab, including the critical areas of publication and attribution. Current scientific papers bear large strings of authors–what did each author actually contribute? The last author is often a chief scientist who did none of the experimentation or writing on the paper, but organized and directed the team. There are also analysts with valuable skills that indirectly make the research successful.

Publishers are therefore creating forms for entering author information that specifies the role each author played, called multidimensional author descriptions. Data mining can produce measures of how many papers each author has worked on and the relative influence of each. Universities and companies can use these insights to hire good candidates to fill the particular skills they need.

One of the first steps to data sharing is simply to identify and label it, at the relevant granularity. For scientific data, one linchpin is the Digital Object Identifier (DOI), which uniquely identifies each data set. When creating a DOI, a researcher provides critical metainformation such as contact information and when the data was created. Other researchers can then retrieve this information and use it when determining whether to use the data set, as well as to cite the original researcher. Metrics can determine the “impact factor” of a data set, as the now do for journals.

Sage supports DOIs and is working on a version layer, so that if data changes, a researcher can gain access both to the original data set and the newer ones. Clearly, it’s important to get the original data set if one wants to reproduce an experiment’s result. Versioning allows a data set to keep up with advances, just as it does for source code.

Stephen Friend, founder of Sage, said in his opening remarks that the field needs to move from hypothesis-driven data analysis to data-driven data analysis. He highlighted funders as the key force who can drive this change, which affects the recruitment of patients, the collection and storage of data, and collaboration of teams around the globe. Meanwhile, Sage has intervened surgically to provide tools and bring together the people that can make this shift happen.

Related Resources:

Strata Rx Heath Data Conference — Strata Rx brings
together the diverse communities driving innovations in big data analytics for health care. Learn about the transformation of health care through big data and how to position your company to benefit from these trends. Learn more.
tags: , , , , , , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.