• Print

Challenges for the New Genomics

New guest blogger Matt Wood heads up the Production Software team at the Wellcome Trust Sanger Institute, where he builds tools and processes to manage tens of terabytes of data per day in support of genomic research. Matt will be exploring the intersection of data, computer technology, and science on Radar.

The original Human Genome Project was completed in 2003, after a 13-year worldwide effort and a billion dollar budget. The quest to sequence all three billion letters of the human genome, which encodes a wide range of human characteristics including the risk of disease, has provided the foundation for modern biomedical research.

Through research built around the human genome, the scientific community aims to learn more about the interplay of genes, and the role of biologically active regions of the genome in maintaining health or causing disease. Since such active areas are often well conserved between species, and given the huge costs involved in sequencing a human genome, scientists have worked hard to sequence a wide range of organisms that span evolutionary history.

This has resulted in the publication of around 40 different species’ genomes, ranging from C. elegans to the Chimpanzee, from the Opossum to the Orangutan. These genomic sequences have helped progress the state of the art of human genomic research, in part, by helping to identify biologically important genes.

Whilst there is great value in comparing genomes between species, the answers to key questions of an individual’s genetic makeup can only be found by looking at individuals within the same species. Until recently, this has been prohibitively expensive. We needed a quantum leap in cost-effective, timely individual genome sequencing, a leap delivered by a new wave of technologies from companies such as Illumina, Roche and Applied Biosystems.

In the last 18 months, new horizons in genomic research have opened up, along with a number of new projects looking to make a big impact (the 1000 Genomes Project and International Cancer Genome Consortium to name but two). Despite the huge potential, these new technologies bring with them some tough challenges for modern biological research.

High throughput
For the first time, biology has become truly data driven. New short-read sequencing technologies offer orders of magnitude greater resolution when sequencing DNA, sufficient to detect the single-letter changes that could indicate an increased risk of disease. The cost of this enhanced resolution comes in the form of substantial data throughput requirements, with a single sequencing instrument generating terabytes of data a week–more than all biological protocols to date. The methods by which data of this scale can be efficiently moved, analyzed, and made available to scientific collaborators (not least the challenge of backing it up), are cause for intense activity and discussion in biomedical research institutes around the globe.

Very rapid change
Scientific research has always been a relatively dynamic realm to work in, but the novel requirements of these new technologies bring with them unprecedented levels of flux. Software tools built around these technologies are required to bend and flex with the same agility as the frequently updated and refined underlying laboratory protocols and analysis techniques. A new breed of development approaches, techniques and technologies are needed to help biological researches add value to this data.

In a very short space of time the biological sciences have caught up with the data and analysis requirements of other large scale domains, such as high energy physics and astronomy. It is an exciting and challenging time to work in areas with such large scale requirements, and I look forward to discussing the role distribution, architecture and the networked future of science here on Radar.

tags: , , ,
  • http://semanticlifescience.wordpress.com/ Ntino

    looking forward to read how Sanger uses Amazon Web Services for scaling up its computing needs !

  • http://mndoci.com Deepak

    Matt

    Great to see you blogging at Radar. As biology gets more and more data intensive,how we think about software, data, collaboration are all going to be impacted and it’s nice to see someone right in the middle of things blogging about it to a tech audience. Can’t wait for #2 (and 3 and 4 and so on)

  • http://www.mymeemz.com Alex Tolley

    “In a very short space of time the biological sciences have caught up with the data and analysis requirements of other large scale domains, such as high energy physics and astronomy.”

    I’ve been out of bioinformatics for over 2 years, but I don’t think that biology has come even close to the data generation of physics or astronomy.

    The LHC generates data at 15 million Gigabytes of data a year (15 petabytes) that need to be shared, stored and analysed.

    For comparison, that might be the equivalent of generating the data for 1-10 million human full genome sequences. At a target price of $1000/genome sometime in the next 10 years, that would cost $1-10 billion. At current prices that is 3 orders of magnitude higher, ie in the $1-10 trillion range.

    Biology is definitely on the upward sweep of the data generating curve, but other sciences haven’t stood still, each increasing its data rates along the technology availability and cost paths.

  • http://basiscraft.com Thomas Lord

    I agree that there are exciting challenges in those areas. One challenge is to keep the biologists in check: I’ve seen some pretty egregious doubtful assumptions about computation made (in the realm of short-read (re)sequencing). That funny sort of problem aside: yes, there are lot of fun CS brain-teasers in that field.

    I disagree with the following statement, although I agree with what was probably really meant or what would be mutually agreeable in its place: “A new breed of development approaches, techniques and technologies are needed to help biological researches add value to this data.”

    Pish posh to “new”.

    The thing about the scale of this data and the most typical processing tasks — that scale relative to the scale of hardware — is that it takes us back in time, from a CS perspective. It takes us back in time to an era where a data set is something to be streamed out of or back into tertiary storage in small chunks and secondary memory is usually too small to solve problems in a single pass. It takes us back to “tape algorithms”. It takes us back to an era where for the size of units of data being transferred, “postal net” is often (at least) competitive on both bandwidth and latency compared to wired transport.

    From another angle the new genomics also takes us back in time to pre-sql days — back to physical databases as the fundamental abstraction (Stonebraker’s recent column-oriented work notwithstanding — but it’ll be a long time before that’s directly applicable to genomics). That is, we’re more often worried about directly handling indexes and hand-writing the implementations of our queries — more “Berkeley DB” end of the spectrum, less “MySQL” — that kind of thing (not meaning either product literally — just to make clearer to a broader audience the physical DB vs. SQL (logical) DB distinction I’m making).

    The main “new” thing is the economics of coarse-grain MIMD parallelism — the fact that we can build clusters on the cheap rather than having to sequentially sequence components of a composition of what a single core can do in a single run over the data set. But even that “new” thing is pretty well understood: a connection machine cum map-reduce topology with something about as complex as AWK at the nodes will do. I wrote a first-cut version of the AWK-like part already and demonstrated its prowess for the Church lab (not that I suspect they’ve done anything useful with it). Used it to demonstrate some fancy applied regexp theory in aid of the quixotic quest for certain forms of short-read re-sequencing. The distribution glue for such a system is still an open problem although the emerging map-reduce-inspired platforms are the general direction…

    What all that implies is that the output of these machines that generate terabytes of data quickly is likely to be aggregated — there’s one set of businesses housing this data and providing a platform for querying it and a separate set of businesses generating the data — the latter pipelines to the former. I would suspect the “sweet spots” are at three scales: aggregation / query machines for the scale of today’s independent medical testing labs; aggregation / query machines for the scale of biotech enterprises such as drug companies; global aggregation / query machines such as Google might offer as a side business. Feed your raw sequencing data to such an aggregator and run your queries there.

    The same raw technology — tape-like tertiary storage, physical db tools, a simple (“awkish” in some spiritual sense) node configuration language, and simple flexible patterns like map-reduce — that same platform has lots of applications and is what a lot of the truly successful “cloud” of cloud computing will consist of.

    -t

  • http://mndoci.com Deepak

    Alex

    Good point, but I’d like to add that the kinds of data that modern sequencers churn out is accessible even to individual labs, and there are a lot of them. Also the sheer throughput and acceleration in data production is quite amazing. Expect to see throughput increase 2-3 times in the next few years (at least). High energy physics and astronomy are still the realm of the few and projects move in the “decade” timescale. Two years is too long in modern genomics.

  • Falafulu Fisi

    Alex said…
    I’ve been out of bioinformatics for over 2 years, but I don’t think that biology has come even close to the data generation of physics or astronomy.

    Umm! It depends on what you do. The bioinformatic institute at the local University do run their DNA sequencing in a super-computer that takes hours. Bioinformatics (& computational biology in general) fall in the domain of scientific computing (not because biology is a branch of science but because the algorithms used in the analysis are memory and numeric intensive). Computational economics is one domain of scientific computing (because of the algorithms used since economics is not science but social science). I don’t work in the domain of bioinformatics but I know almost all the analytic algorithms used in their analysis since the algorithms are universal. Physicists used them, Economists used them, Mechanical Engineers used them, Electrical Engineers used them and so forth.

    One area that I find interesting in computational biology is the modeling of protein folding, with a huge (potential) application in designing effective drugs. Protein folding modeling is quite intensive.

  • Falafulu Fisi

    Correction:

    I know almost all the analytic algorithms used in their analysis

    meant to say:

    I know some of the analytic algorithms used in their analysis

  • http://www.mymeemz.com Alex Tolley

    Falafulu Fisi: What I did and what I had knowledge about are different things. I don’t do astronomy, but I am aware of the data rates of the Hubble Space Telescope. I well remember the zeitgeist about biology data would dwarf that of other disciplines, like atomic physics, climate modeling etc. It just hadn’t happened and those other disciplines weren’t static in their data generation either.

    The examples you provided are however, not data intensive but CPU intensive. The protein folding problem is indeed a hard one, but the amino acid sequence is small. Even if the modeling required the data of other proteins, e.g. chaperones, the size of the data set is small. A database of known sequences and shapes would be relatively small too in comparison to other sciences.

    Deepak: You make a good point in that the volume could come from lots of independent data sources, like cheap sequencers. Moreover, the growth rate could be faster than large physics or astronomy projects. If all the unique data was submitted to Entrez then the data set could get very large indeed in this scenario, but as I think I showed, the sample size would have to increase quite dramatically (with the associated costs) before it is comparable to the LHC data alone. However, there are other data sources that are also growing quickly and have the potential to dwarf even biological data, e.g. sensor data streams. Video alone is a massive and growing data source and we have just started generating it in volume.

    No question that data storage, search and wrangling (to use Bruce Sterling’s term) will be very big problems to tackle in this century.

  • http://basiscraft.com Thomas Lord

    @Alex,

    Falafulu Fisi: What I did and what I had knowledge about are different things. I don’t do astronomy, but I am aware of the data rates of the Hubble Space Telescope. I well remember the zeitgeist about biology data would dwarf that of other disciplines, like atomic physics, climate modeling etc. It just hadn’t happened and those other disciplines weren’t static in their data generation either.

    Give it a week, so to speak. This is a thing that is just beginning.

    The various kinds of sequencing machines (with more in the pipeline) give you lots of “reads” of fragments of a sample of DNA. You infer from these various facts about the complete sequence (in some cases, the complete sequence itself).

    Those machines are going through a kind of exponential growth in capacity (e.g., number of reads per dollar-hour spent doubles in some N months). So, world-wide, the amount of such data is exploding.

    The final thing is that there is a large backlog of computational experiments people want to do on large aggregates of those individual data sets. So, when you look at total world-wide output of reads today, realize that in a few years there will be lots of aggregated caches of that much data and more with people wanting to run queries over the entire aggregate.

    Envelope: 750Mb per human genome * sample size of 20M humans: 15 petabytes of data — about the projected output of the LHC in a year and 4/5ths of what all of Google (is said by Wikipedia) to process in a day.

    At $1,000 per human genome (a price projected to be soon achieved, at a profit no less) that $20B. That’s about 3-4 LHCs which sounds like an aweful lot but it’s also less than 1% of what the US spent on health care in 2007 which makes it sound like not much at all.

    So, 20M people, let’s say, get sequenced for medical purposes, their health plans footing the bill, and they agree to anonymously share their data and some demographic factoids “for science”. The data gets aggregated at, say, NIH, but anyone can pay the costs to have a copy shipped to them.

    There are lot of queries we’d like to run over the entire 15 petabyte aggregate.

    Oh, and, while it theoretically would take LHC a fully productive year to produce that much data — theoretically we’ll (soonish) be able to produce that much sequencing data in weeks or months.

    And that’s just humans. I used just humans to keep the numbers easy for envelope purposes but how about we sequence a few million insect species? a few 10million plant species? bacteria, etc….

    30,000 base pairs per penny…. step right up, please consider donating a copy of your data for scientific research, have a nice day…

    -t

  • http://www.nextbio.com/b/home/home.nb Lisa Green

    Excellent post Matt!

    It is indeed an exciting time for biology as it becomes more data-driven. These comments contain quite a bit of discussion on exactly how much data is being generated by biological science relative to other fields of science, but I think everyone would agree that modern biological science is generating vast amounts of data.

    Of all the activity and discussion surrounding biological data, most exciting to me are the ideas about how best to make data of this scale available to collaborators.
    I work for a company called NextBio NextBio. The goal of NextBio is to create a publically accessible online repository of biological data. The reason I am passionate about my work is because I sincerely believe that improving how we share data will significantly improve the way biological science is done.

    I look forward to reading your upcoming posts and hearing your thoughts on new methods for making data available to scientific collaborators.

  • http://www.mymeemz.com Alex Tolley

    Thomas, you make a compelling case. However, I do not think the $1000/genome cost is coming soon, unless that means about a decade. As I recall, after the human genome was sequenced, there was very little market for sequencers as demand declined, which upset a lot of manufacturers. We’ll see how demand pans out this time. I don’t disagree that the rate of bio data is increasing very fast, I’m just not convinced that the breathy comments take into account that astronomy and physics are also growing exponentially too, especially astronomy. All this has no bearing on teh computing and storage demands issues, with which I agree are going to be challenging and interesting.

  • http://basiscraft.com Thomas Lord

    Alex, you wrote: “I do not think the $1000/genome cost is coming soon, unless that means about a decade.”

    A decade isn’t that far out and would still imply that today is the time (yesterday, really) to be working on that eventuality. Additionally, my sense is that it will be less than 10 years because people keep extending “read lengths” on affordable short-read gathering and they don’t have to be all that long before the $1000 target is a done deal (AFAICT).

    As I recall, after the human genome was sequenced, there was very little market for sequencers as demand declined, which upset a lot of manufacturers.

    That would help explain 23andme and other efforts to find new monetizations of this area of technology.

    The other thing I’ve heard is that global demand for the technology is growing at a good clip, even if certain vendors aren’t favored by this development. Part of the reason people in the US are keen to keep going in this area is fear of a knowledge gap between “West and other,” so to speak.

    But, sure: there’s no reason (or deep basis) to paint this as “genomics beats physics and astronomy for data quantity” — it’s all in the same ballpark for practical purposes.

    -t

  • http://www.sdrav.com/en TV studio film lighting

    Thanks a lot. keep it up.