The Human Genome Project took 13 years to fully sequence a single human’s genetic information. At Washington University’s Genome Center, they can now do one in a week. But when you’re generating that much data, just keeping track of it can become a major challenge in itself. David Dooling is in charge of managing the massive output of the Center’s herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He’ll be speaking at OSCON, the O’Reilly Open Source Convention. His talk, titled The Freedom to Cure Cancer: Open Source Software in Genomics, will be about how he uses open source tools to keep things under control, and he agreed to talk about how the field of genomics is evolving.
James Turner: Can you start by describing what it is you do and how you came to be doing it?
David Dooling: Sure. I work at the Genome Center at Washington University in St. Louis. We are one of the handful or so of large scale genome sequencing centers around the world. What that means is essentially we participate in large genome sequencing projects that some people may have heard of, like the Human Genome Project, Thousand Genomes Project, things like that. And involved in that is a lot of data processing, laboratory processing, tracking and all sorts of things, so it’s a rather large enterprise.
There are about 300 or so people that work here. And how I came to work here was that about eight years ago, I decided that I wanted to get more into programming and more into open science. So I took a job as a programmer here at the Genome Center and gradually worked my way around to where I am now, where I oversee all of the software development and IT infrastructure here at the Genome Center. And it’s a fairly large IT infrastructure.
We have somewhere around three petabytes of storage online, and somewhere north of 3,000 cores in our computational cluster. And we’re generating terabytes, tens of terabytes of data, per day with our current sequencing instruments. The sorts of things that we’re doing now as we transition from more fundamental evolutionary types of projects, such as the Human Genome Project and subsequent projects like the Mouse Genome Project, we’ve done things like corn and things of that nature, now we’re doing more and more sequencing projects related to medicine and medical sequencing.
Last year, we published the full cancer genome sequence. In doing both the cancer and the normal, we were able to determine the differences between those two genomes and begin to identify what might’ve possibly caused cancer in that individual. So projects like that. We’re also doing projects with metabolic syndromes, like diabetes, and several other cancer projects as well. That’s essentially what we’re doing and how we’re doing it and how I got here.
James Turner: Genomics is an area that seems to be on the steep part of the hockey stick curve right now. In just a decade, we’ve gone from sequencing one genome over a period of years to doing them routinely. Can you talk a bit about what’s enabled this acceleration?
David Dooling: Well, a whole host of things. But I think really at the core was the changing fundamentals of sequencing itself. For a long time, DNA sequencing was based on a process invented by Sanger, sometimes called Sanger Sequencing, sometimes called capillary electrophoresis now because of the last revision of the instruments that were generated. But essentially with that approach, you did reactions in 96 plate wells. You processed sequence in these 96 plate well chunks. And you did reactions in there. You loaded them on the readers, and the readers read out sequence for each of those 96 wells. So that’s sort of how you processed it. And at the height of that sort of sequencing, which was only a few years ago, we had about 130 or so of those instruments each churning about 15 to 20 runs per day. Each run gave you 100 pieces of sequences. You had 100 or so machines. And so you got on the order of a few thousand sequence reads, that’s what we called them, because of the way the instrument read the information.
Now, since that time, 454 was first [of the new generation of sequencers] and then Solexa came, which was later bought-out by Illumina, and the ABI SOLiD has a platform. There’s one from Helicos as well. And then several other third generation, those first being the second generation, sequencers have come out. And what those do is greatly increase the parallelism with which you’re able to process DNA and sequence it. So instead of a few thousand runs per day, or a few thousand reads per day, you may get a few million reads per run. And these runs, for some of the platforms, do take a little bit longer. But the parallelism of it increases your throughput tremendously. And so now we have about 35 to 40 of these highly parallel instruments in-house. And with that, we’re able to sequence the human genome to complete coverage in less than a week.
So the main driver has been this change in the sequencing technology and the parallelism of it. It’s a fundamentally different chemistry, different physics. The flipside of it is that we talked about the hockey stick, and so that hockey stick is the sequencing hockey stick, but it’s brought several other hockey sticks along with it, mainly the amount of data that these things generate. And the amount of processing power that is required to process that data has increased greatly as well. Much faster than Moore’s Law over the last two years or so. Whereas with those original instruments, you would generate on the order of megabytes per day, now we’re doing tens of terabytes per day with these new instruments. And then processing that, instead of taking a single processor a few minutes, it can take a small cluster a few days to actually analyze the data from each of these runs.
Those are the main things. The enabling technology was the change in the sequencing chemistry itself. And then what had to come along with that was building these infrastructures to be able to track these things and process these things and store all of this data as the instruments increased in their abilities.
James Turner: As you mentioned, you’ve moved on from humans to mice and now plants and other life forms. How is the decision made as to what to sequence next?
David Dooling: There’s a lot of things that go into that. Originally, when you were talking about — well, even before the Human Genome Project — there was a large discussion and debate as to what would be the first multi-cellular organism to be sequenced. And the main contenders were C. elegans and drosophila melanogaster, the fruit fly. C. elegans is a small roundworm. And eventually, C. elegans won, or was chosen I should say. And that was the first multi-cellular organism. That was done as a prototype for the approach that would be used for the Human Genome Project. That essentially was done through a consortium. The funding agencies got people together, there were meetings and discussions. And eventually, one was chosen.
After the human genome project, there were a few organisms that were highly relevant to genetic research. For example, the mouse. There are many models of human diseases in mice on different strains of mice. So that was a natural next step after the human genome project. And then you have other things that are of evolutionary interest, other great apes like the chimpanzee were after the mouse. But at that point the sequencing capacity of the center started to increase. We were still on the former Sanger sequencing and capillary electrophoresis solution, but we got better and better at what we were doing. The machines got more efficient, more reliable. So the capacity increased. And so there began to be a need to fill that capacity with more and more organisms. So what the funding agencies set up was a white paper process where people who were studying different organisms would write a white paper explaining their organism and why it should be sequenced and what it would add to the scientific knowledge. And then those would be reviewed by a panel at the funding agencies and then selected and assigned to the different sequencing centers, or several sequencing centers if the projects were big enough. So that’s how the animals and the plants … were chosen.
Some things were done through their own [funding agencies]. For example, with the corn [genome], that was a joint project between several different funding agencies in the National Institutes of Health. They got together and decided they want to [sequence] this strain of corn or maize. And that was assigned and done. Then there’s been the International Consortium that is driving the 1000 Genomes Project, where they wanted to do the sequencing of hundreds or even thousands of genomes. And that, again, was bandied about by scientific luminaries around the field of genomics, and then a proposal was made. The different funding agencies around the world that direct the centers and give them their money weighed in on these, then the project was done. And then some of it is driven purely through collaborations that the genome centers have, either externally or internally. For example, with the AML sequencing of the first cancer genome, that was done with a collaborator here at Washington University in St. Louis. We actually went and got the funding for that ourselves because we thought it was such a compelling project. So there are a whole host of ways that these things get chosen and done.
James Turner: The Human Genome Initiative was a huge multi-center effort. It sounds like it could be pretty much done in-house by a single facility now. How many facilities are there that can do a complete sequence at this point?
David Dooling: That’s hard to say. I think that number changes everyday, but it’s probably in the hundreds, if not thousands, nowadays. Basically, you could do it in a few months with a single instrument, although the project is a little bit different now than it was then. What do I mean about that? We already have the human reference, and the type of data that you get from these instruments is of the nature that you can’t easily do something from scratch, what we call a de novo genomic assembly. But what you can do is take these reads and then align them back to the reference and see, “Oh, I have this chunk of DNA here. Where does that live in the human genome? And how does it differ from the reference human genome?” That’s what we call re-sequencing, sequencing an organism that you already have a reference for. So in that sense, because of what the Human Genome Project has done and the human reference sequence exists, it enables us to be able to do these new sorts of projects in a lot less time, along with new technology.
James Turner: It’s kind of like the difference between trying to do a puzzle where you don’t have the box cover and one where you do, I guess.
David Dooling: That’s a fairly good analogy, yes. I mean if you had a puzzle where you didn’t have the box cover and you had four pieces, it would be pretty easy to figure out. If you had three billion, it would be very difficult. If you move away from four and get closer to three billion, it becomes harder and harder. And that’s sort of the case here where if you have lots and lots of small pieces, if you have something to compare it to, it makes the project simpler and more amenable to that sort of data type. Whereas if you get it in big chunks, it’s sort of independent.
James Turner: It sounds like there are a lot of informatics challenges with genomic data. There’s the computational challenge of doing the sequence, which you mentioned. There’s a challenge of managing the resulting data and finding meaning in it. And then there’s the challenge of applying that understanding to a larger population. First of all, did I miss any of the challenges? And second of all, what are the unique problems in each set of those?
David Dooling: Well, let me talk a little bit about each of the ones you did mention, and maybe that’ll bring up some that you didn’t. So as far as just analyzing the data and generating the data, that is computationally intensive because essentially what you’re getting off of these new instruments is pictures, images. And you need to apply algorithms to detect features in those images and then translate those features accounting for different vagaries of chemistry, and then resulting in a sequence, a series of basic Gs, As, Cs and Ts, the building blocks of DNA. Once you have that information, there’s a whole host of secondary analysis, or analysis of biological relevance if you want to think of it that way, that need to happen, and those are project-specific. So for some sorts of projects, for example a cancer project, you would want to find all of the ways that the DNA that you sequenced differs from the reference and then take — for the tumor, let’s say. And then for the normal, do the same thing. And then for all of those variants, find out which ones are unique to the tumor genome as compared to the normal genome.
And then once you find those variants, you want to figure out what they mean; what they do. The human genome is only about one percent coding into proteins. So that’s what we were always told that DNA does. DNA encodes RNA. And then RNA becomes proteins. And the proteins do the work of the cell. But only one percent of human DNA actually encodes proteins. The rest, some of it may be junk. For a long time, most of it was thought to be junk. But now, as we learn more about the genome, we find that there’s a lot of stuff that may have very subtle effects on what genes are transcribed into proteins and what aren’t turning genes on and off, if you will. And other regulatory things, non-coding RNAs and all sorts of different things that the DNA that’s not coding into proteins may do. And so you want to find out of those variations which ones are doing what? Where do they sit? How might they affect the protein if they sit in that region? How might they affect or promote a region if they sit in that region? So once you find the variations, you want to find out what they do and what they might mean biologically.
Now for other sorts of analysis, for example, if you’re sequencing an organism for the first time, you may be using longer a sequence that you get from, say, the 454 instrument, which is a few hundred bases long, as opposed the Illumina instrument, which is anywhere from 75 to 100 bases long. On those projects, what you want to do is try to assemble those bits and pieces together. While the bits aren’t as big as they used to be with the Sanger sequencing, where they were 700 to 800 bases long, they’re long enough that for simpler genomes without a lot of complex features you can assemble them pretty well together and get an assembled genome of a fair quality. And that can be a very computationally intensive thing.
Another one that you talked about was tracking all of this. When you have 30 to 40 of these machines going, starting and stopping at different times, generating data, the data for a single project can be splayed across several instruments over several months, depending on how the project is done and how busy things are. Keeping track of all of that data, what’s traditionally been called the LIMS, the Laboratory Information Management System, can be challenging. Add on to that the fact that we provide community resources. So we’re not just generating sequence for our own sake, once we generate it, we want to submit it to central repositories so that other people can access it and get information from it and build on the information that we’re providing. Taking all of that information that was created while you tracked the sample as it went through the pipeline, tacking that on to the actual sequence data and then submitting that in bulk. And nowadays, submitting hundreds of gigabytes per hour in some cases. So that can be a real challenge.
There’s also the challenge of analysis. As we generate more and more data, you’re able to ask more and more complex questions. You talked about the Human Genome Project. We spent years generating the sequence for one genome. Now with 35 to 40 machines, we can generate lots of sequence on lots of different humans. And so that opens up a whole new line of analysis of comparative genomics where you’re comparing human to human to human to human genomes.
This was an analysis that wasn’t even feasible two years ago. Now it’s something that’s becoming more and more routine. Not only do you have to do a lot of analysis that requires disk space and CPU time, but you also want to track this analysis in the same way that you track your laboratory processing because you want to track what genomes have you analyzed, which haven’t. What sorts of parameters were used on this genome versus that genome? You may want to analyze a single genome several different ways to figure out which sets of tools and processing profiles and input parameters and filters and all of these things provide the best answer, the most realistic answer, give you the most sensitivity and high specificity, these sorts of things.
Then you want to create the ability to compare and contrast all of the results that you get so that you can discern what are the good results; what are the bad results; what makes sense; what doesn’t make sense? Essentially what we’re finding is that we’re creating the equivalent of a laboratory information management system for analysis so that we can not only do the analysis in a computational way, but also track it and keep good data provenance for the data that’s out there and the analysis and the results.
One of the difficulties that you didn’t mention was just interoperability. The pace at which all of this stuff is changing and moving is very fast. We have significant resources inside that we’re always trying to push the envelope and improve how we do things. Everybody else that’s working on these systems is doing the same sorts of things. Always pushing that envelope but still being able to compare and contrast what one side is doing versus another side, sharing the data, getting it back and forth. Sharing software, software that’s often made with great rapidity, often highly customized to a single environment, the environment that you’re developing, and just so that you can develop it more and more quickly. Sharing software. Sharing data. That’s a real significant challenge because of the pace at which things are changing now.
James Turner: People can get huge amounts of data about their genome for a few hundred dollars now. And the day is coming soon when a full genome sequencing is just going to be a matter of swiping your credit card. The actual genome itself is only about three gigabytes, but in itself, it’s pretty meaningless to somebody. There are already some good open source tools and databases to help people look at their single nucleotide polymorphism data, but having used one, they’re pretty geeky to use. Is it coming time that we’re going to need the equivalent of a Firefox for the genome?
David Dooling: Maybe a Firefox plug-in, right? I think everything is trending in that direction. With the consumer tools that are available now, you’re essentially getting information on SNPs, single nucleotide polymorphism. They call them genome scans, but that’s essentially what you’re getting. Knome will sequence your genome for a couple hundred thousand dollars or $100,000 But yeah, I mean the cost of sequencing will be below $50,000 for a whole human genome by the end of this year. And it could be significantly lower than that.
So the difficulty is when you talk about the sequence itself being not that useful, that’s absolutely true. But the other side of that is the interpretation that we’re able to put on a lot of the genome is also not that useful. There’s only a small subset of the genome that we really understand well what it does and how it behaves, and if you change it, what that means for our development or health or what have you. There are vast swaths of the genome that we really don’t understand what they do and how they necessarily interact with the rest of the genome or with the proteins or the cell.
So commensurate with increasing technology and driving down the cost, it is going to create a wealth of data that we’re going to have to find new ways to mine and get information from so that we can better understand what these vast stretches of the genome, that we currently don’t have a good understanding of, do. Once we have that, then you’re going to see — and I mean it’s going to be a leap-frogging thing where more and more data is generated. We gain more and more understanding. And they keep leap-frogging each other. And what’s going to happen is you’re going to see, as opposed to a big splash of a version 1.0 of some sort of genome browser, you’re going to see gradual increase. Gradual on the relative scale. I mean, it’s obviously going to be rapid on the time scale of genetic studies; it’ll be over the next few years.
But what you’re going to see is our ability to take a whole human genome sequence and really be able to map that into the health risks and disease risks that these SNP companies are essentially layering on top of their SNP data for people. It’s going to be a much wider scope of information. But it’s also going to be a lot more complex because the human body, the cell, it’s a very complex system. The sort of information that you’re going to get from it is going to be more advisory and probabilistic as opposed to some sort of binary, “Yes/No. You’re going to get disease X, Y, Z.”
James Turner: So it’s not going to be Gattaca; it’s going to be much more, “You might die by the time you’re 30.”
David Dooling: Yeah. Well, even in Gattaca, there was some wiggle room, right? Because he ended up on the spaceship, right? So yeah, genomics is not destiny, I would say. I think as we come to understand it more and more, I think what we will be understanding are the subtleties, as opposed to the black and whites. I think the black and whites are already pretty clear. What we’re going to be investigating for the time being is the shades of gray of what does this change mean and how does it affect your susceptibility to this.
James Turner: You mentioned earlier you’re currently working cancer genomics. It sounds like if you’re comparing a noncancerous cell to a cancerous cell, that would be a fairly straightforward; this is a different thing. Where do the complexities come in in that?
David Dooling: I think the scale is difficult, for one. Also, the linking of the variations between the two to something that’s biologically meaningful. It certainly was a lot harder a year ago when we started this because the throughput of the machines was such that you needed 100 runs or so to get the amount of data that you want. Now you need somewhere around three or four runs to get what you need for a single human genome. So I think the difficulty is getting further and further downstream. Initially, the difficulty was getting these instruments to work right, getting them to behave well in a laboratory setting, to perform reliably. Then it was aggregating all of this data and developing the tools to process it and make sense of it.
We have that pretty well under control. Now the difficulty is pushing further and further down the pipeline, if you will. Now we can run these instruments. We can generate a lot of data. We can align it to the human reference. We can detect the variance. We can determine which variance exists in one genome versus another genome. Those variances that are cancerous, specific to the cancer genome, we can annotate those and say these are in genes. These ones that are in genes change the protein that that gene is encoding. These ones are in regulatory regions. These ones are in non-repetitive regions. So we can do all of those things relatively easily. Now the difficulty is following up on all of those and figuring out what they mean for the cancer. And, okay, we know they’re different.
We know that they exist in the cancer genome, but which ones are drivers and which ones are passengers? And what that means is which are the ones that were the cancer initiating events and which are the ones that are just coming along for the ride and don’t really have anything to do with the tumor genesis or the disease itself but, just because cancer is so out of control and lots of things have gone wrong in the cell, finding which ones are actually causative is becoming more and more the challenge now.
The other big challenge is this is a sort of multigenome analysis, and the ability to look at a bunch of genomes and begin to see patterns and figure out what’s important and what’s not important.
James Turner: Just to clarify; when you talk about going against a reference, are you going against one genome reference or are you comparing it with the same individual in a noncancerous area?
David Dooling: Both actually. So the first thing that we do with these sequence reads when they come off the instrument is, for a human, we would align them to the human reference. There’s an international consortium that curates the human reference. It’s essentially a chromosome by chromosome description of the human genome. Now once we have that, what we do is detect differences between the sequence that we have in that reference. Then what we do is all the variance that are already known, so there are databases of known variation. One of them is called dbSNP, the Database of Single Nucleotide Polymorphisms. We’ll compare the variations that we got against the variations in dbSNPS, and we’ll throw away all the ones that exist in dbSNP because those are known to be just normal variation from one individual to another.
Then we’ll compare the variance to, for example, the variants that still remain in a tumor genome versus the variants that still remain in the normal genome. In this way, it’s sort of a whittling down. But we do the alignments first. We find out what’s different. We end up with a much smaller number of pieces of information to follow and to process. We take those. We whittle them down even further by getting rid of the ones that have already been previously described. Then we go down and compare the tumor and the normal. And then we go from somewhere around 2.5 million — well, we go from somewhere around 90 gigabases of sequence information to 3.5 million variations to then somewhere around a few hundred variations that are different in the tumor than in the normal.
James Turner: Just out of curiosity, just between any two cells in the body, is there a significant variation?
David Dooling: There shouldn’t be, but there can be, yes. So between two humans, the difference is about one every thousand or so bases, I believe, if I recall correctly. And between two cells, one cell right next to the other, they should be identical copies of each other. But sometimes mistakes are made in the process of copying the DNA. And so some differences may exist. However, we’re not at present currently sequencing single cells. We’ll collect a host of cells and isolate the DNA from a host of cells. So what you end up is with when you read the sequence out on these things is, essentially, an average of this DNA sequence. Well, I mean it’s digital in that eventually you get down to a single piece of DNA. But once you align these things back, if you see 30 reads that all align to the same region of the genome and only one of them has an A at the position and all of the others have a T at that position, you can’t say whether that A was actually some small change between one cell and its 99 closest neighbors or whether that was just an error in the sequencing. So it’s hard to say cell-to-cell how much difference there is. But, of course, that difference does exist, otherwise that’s mutation and that’s what eventually leads to cancer and other diseases.
James Turner: To finish off, apart from your talk you’re giving at OSCON, which is going to be on a lot of the technology that you’re using here and how it fits in with open source, is there anything that you are particularly looking forward to seeing or you think would be really cool at the conference?
David Dooling: Well, we use a lot of Perl here, so I’m looking forward to the State of the Onion. I’m looking forward to a lot of the talks on database in the database track to get a better understanding of what other people are doing, because there’s a lot of disciplines that are dealing with significantly more data than they had five or ten years ago. Or there are some disciplines that didn’t even exist five or ten years ago that are dealing with a lot of data nowadays. I’m interested in finding out how people are dealing with that and what tools that they’re using. One of the things that we’re looking to develop as a resource for other people in addition to the sequence that we generate is also software. A lot of places don’t have the infrastructure computationally or storage-wise that we have, so we’re looking to develop tools that would enable more and more people to leverage things like open science grid and cloud computing to be able to analyze these data, these sequence data. So I’m interested in attending those sorts of talks as well.
James Turner: All right. Well, David, thank you so much for taking the time to talk to us. It was really fascinating, and I’m sure anyone who comes to your talk at OSCON is going to come away with their eyes wide open.
David Dooling: I hope so. Thank you.