New guest blogger Matt Wood heads up the Production Software team at the Wellcome Trust Sanger Institute, where he builds tools and processes to manage tens of terabytes of data per day in support of genomic research. Matt will be exploring the intersection of data, computer technology, and science on Radar.
The original Human Genome Project was completed in 2003, after a 13-year worldwide effort and a billion dollar budget. The quest to sequence all three billion letters of the human genome, which encodes a wide range of human characteristics including the risk of disease, has provided the foundation for modern biomedical research.
Through research built around the human genome, the scientific community aims to learn more about the interplay of genes, and the role of biologically active regions of the genome in maintaining health or causing disease. Since such active areas are often well conserved between species, and given the huge costs involved in sequencing a human genome, scientists have worked hard to sequence a wide range of organisms that span evolutionary history.
This has resulted in the publication of around 40 different species’ genomes, ranging from C. elegans to the Chimpanzee, from the Opossum to the Orangutan. These genomic sequences have helped progress the state of the art of human genomic research, in part, by helping to identify biologically important genes.
Whilst there is great value in comparing genomes between species, the answers to key questions of an individual’s genetic makeup can only be found by looking at individuals within the same species. Until recently, this has been prohibitively expensive. We needed a quantum leap in cost-effective, timely individual genome sequencing, a leap delivered by a new wave of technologies from companies such as Illumina, Roche and Applied Biosystems.
In the last 18 months, new horizons in genomic research have opened up, along with a number of new projects looking to make a big impact (the 1000 Genomes Project and International Cancer Genome Consortium to name but two). Despite the huge potential, these new technologies bring with them some tough challenges for modern biological research.
For the first time, biology has become truly data driven. New short-read sequencing technologies offer orders of magnitude greater resolution when sequencing DNA, sufficient to detect the single-letter changes that could indicate an increased risk of disease. The cost of this enhanced resolution comes in the form of substantial data throughput requirements, with a single sequencing instrument generating terabytes of data a week–more than all biological protocols to date. The methods by which data of this scale can be efficiently moved, analyzed, and made available to scientific collaborators (not least the challenge of backing it up), are cause for intense activity and discussion in biomedical research institutes around the globe.
Very rapid change
Scientific research has always been a relatively dynamic realm to work in, but the novel requirements of these new technologies bring with them unprecedented levels of flux. Software tools built around these technologies are required to bend and flex with the same agility as the frequently updated and refined underlying laboratory protocols and analysis techniques. A new breed of development approaches, techniques and technologies are needed to help biological researches add value to this data.
In a very short space of time the biological sciences have caught up with the data and analysis requirements of other large scale domains, such as high energy physics and astronomy. It is an exciting and challenging time to work in areas with such large scale requirements, and I look forward to discussing the role distribution, architecture and the networked future of science here on Radar.