ETech Preview: Science Commons Wants Data to Be Free

John Wilbanks has a passion for lowering the barrier between scientists who want to share information. A graduate of Tulane University, Mr. Wilbanks started his career working as a legislative aide, before moving on to pursue work in bioinformatics, which included the founding of Incellico, a company which built semantic graph networks for use in pharmaceutical research and development. Mr. Wilbanks now serves as the Vice President of Science at Creative Commons, and runs the Science Commons project. He will be speaking at The O’Reilly Emerging Technology Conference in March, on the challenges and accomplishments of Science Commons, and he’s joining us today to talk a bit about it. Good day.

John Wilbanks: Hi, James.

JT: So science is supposed to be a discipline where knowledge is shared openly, so that ideas can be tested and confirmed or rejected. What gets in the way of that process?

This photograph is licensed to the public under the Creative Commons Attribution-Share Alike 3.0 license by Fred Benenson

JW: Well, most of the systems that scientists have evolved to do that: sharing, confirmation and rejecting, evolved before we had the network. And they’re very stable systems, unlike a lot of the systems that we have online now, like Facebook. For science to get on the Internet, it has to really disrupt a lot of existing systems. Facebook didn’t have to disrupt an existing physical Facebook model. And the scientific and scholarly communication model is locked up by a lot of interlocking controls. One of them is the law. The copyright systems that we have tend to lock up the facts inside scientific papers and databases, which prevents a lot of the movement of scientific information that we take for granted with cultural information.

Frequently, contracts get layered on top of those copyright licenses, which prevent things like indexing and hyperlinking of scholarly articles. There’s also a lot of incentive problems. Scientists and scholars tend to have an incentive to write very formally. And the Internet, blogging, email, these are all very informal modalities of communication.

JT: What role does Science Commons play in improving communication between scientists?

JW: Well, if we’re successful, what we want to do is to get to the point where the only problems we have in scholarly communication are technical. This is inspired by Jim Gray from Microsoft, who years ago said, “May all of your problems be technical.” If we can get the law out of the way, and let the traditional norms of science, which really are the ideals of community and of sharing of information apply, you can’t really claim credit for something in science unless you publish it after all. If we can bring those norms into the Internet age with things like standard copyright licenses that Creative Commons has developed, with explorations of new ways to track impact, bringing ideas like trackback that came from the blog world to the scientific communication world. If we can help convince people that the public domain is something to be cherished, and not a thing to be avoided at all costs when it comes to things like data.

If we can make biological material and other sorts of physical research materials move around the world, the way that Amazon moves books around. And if we can make the web work for data the way it works for documents. Right? Those are the things that we really want to do. And if we can do those things, we think that the innate nature of science, which is publishing, which is community-based, which is about sharing information and remixing information, those norms are going to take over if we can simply get the resistance out of the way.

So what we are trying to do is to intervene in places where legal tools, technical tools, policy tools can lower those barriers that are currently preventing the real emergence of the traditional scholarly norms on the internet.

JT: I wanted to go back for a second just to something you had mentioned in your previous answer. Scientific papers tend to be written, as you mentioned, in a very specific dense, passive voice manner. And I’ve talked to my wife who has a science background about it. And she says, “Well, that’s just the way that you present information in science.” Do you think that gets in the way?

JW: I think so. I think there’s a little bit of a Guild mentality, you know, in terms of the language and structure and flow of these papers. It’s taken me some time to learn how to read them. And it’s artificially idealized I think. Because you’re trying to present what happened in the lab one day as some fundamental truth. And the reality is much more ambiguous. It’s much more vague. But this is an artifact of the pre-network world.

There was no other way to communicate this kind of knowledge other than to compress it. Now when we think about compression, we think about zip algorithms. But in the old days, compression meant writing it down on physical paper and mailing it to someone. That was the only way to condense and convey the knowledge in anything representing an effective way. But the problem is, we’ve digitized that artifact of the analog world. And so the PDF is basically a digital version of paper with all of the attendant technical benefits.

So what we need to do is both think about the way that we write those papers, and the words and the tone and how that really keeps people out of science. It really reduces the number of scientists. But we can also think about the possibilities the Internet brings us, if we think about the article as simply a token for several years of research, which include data, which include lab notebooks, which include research materials, software. All of these things that are put into the advertisement that is the paper. Those things can be really very powerful on the Internet. And we can also begin to think about interpreters. You know, there’s an org called ACAWIKI, which I sit on the board of, whose goal is to write human-readable summaries of scientific papers in a wiki format. So we can begin to use some of these tools to really crack the code, to a certain extent, of scholarly writing. And people don’t have to change right away. We can change their works for them, as long as the rights are available.

But if those papers aren’t available, we can’t summarize them on ACAWIKI. We can’t hyperlink to them all the materials in the data sets. We have to wait for the publishers to give us those services, or to sell us those services.

JT: So in some sense, like the IRC chat that the researchers had about their research might be as valuable as the final paper itself.

JW: It would be a piece of the story, you know? We’re wired for storytelling as people. We’re not wired for statistical uncertainty. We try to fit things into narratives. And being able to understand the back-story is very important. So what happens in the hallways at conferences is very analogous to what happens in IRC chats or what happens on blogging and comments on wikis. And we’re starting to see, for the first time, the emergence of blogs and wikis, especially in the newer disciplines, as something that’s very akin to the hallway conversations or the bar conversations at a conference.

But until now, we’ve really never had the ability to put all of that stuff together with the data. I mean can you imagine the human genome before a computer? Even if we could’ve sequenced it? Right? Someone just hands you page after page after page of As, Ts, Cs and Gs.

JT: The infamous Encyclopedia Britannica of genome.

JW: Right. I mean the computer enables all of these things, but the scientific practices are so stable that they’re really resisting change. And so I think of it as the way that we communicate scientifically has evolved as this formal system. And like most good systems, it’s stable against disruption. And it’s stable against bad disruption. But it’s also stable against good disruption. So the entire system, each time you look through, is there a legal problem or an incentive problem or a workload problem, each time you sort of peal those layers, you typically find yourself back at the beginning. When you solve the one round, you’re back at a new legal problem.

And we sort of have to keep pushing the rock up the hill. And the digital Commons methodology, sort of broadly speaking, of standard contracts, distributed workloads like in Wikipedia, where there’s a lot of people doing it. And the incentive problem drops away if you have enough people, because it doesn’t matter why any one person contributes to Wikipedia as long as enough do. That trio, which is the same trio of problems, can become a trio of solutions. And so we have to sort of focus — at the same time we try to change the existing system, we have to cultivate the growth of these systems that use the same tools to build open systems.

JT: How does the desire to share knowledge coexist with the desire to monetize discoveries?

JW: That’s a good question. Right now, they’re pretty tightly linked in a lot of places. And that does create problems. The degree of those problems is a matter for very heated debate. I tend to fall on the side that I don’t really have a problem with monetizing discoveries. I don’t really have a problem with the idea of patents. I have a problem with the way patents sometimes work in day-to-day business. But for the most part, those are things that are tough to solve in licensing methodologies. Those need to be solved by patent reform, things like the crowd sourcing of prior art of patents which I think is a fantastic idea, which is going to solve a ton of problems. The USPTO is badly overworked. If the crowd can help them find prior art, then a lot of patents that aren’t really novel won’t make it through anymore.

And I think those are the — because the debate over monetization tends to turn on patents. I have a much bigger problem with people who try to lock up taxpayer funded literature. The way that we have had this difficulty in getting the clinical trial papers that our tax dollars have funded, that really burns me, because that’s monetizing at the granular level in a way that makes it very hard to do sort of A) first read the stuff you paid for with your tax dollars, but B) take that information and really make it digital. Hyperlink it. Convert it. Mix it. Rip it. Spindle. Mutilate. All of that good stuff that we do with data and information on the web. That’s the stuff that I want people doing. I mean it depresses me that we have so much more innovative programming researchers going into Facebook than we do into clinical trial data. And into the life sciences. But in many ways, that’s a function of the culture, of trying to attach value to every datum and every article, financial value. Instead of thinking about the collective value we could get if that stuff was a common resource.

When it comes to people getting patents out of that, if those are meaningful patents, if they’re valid, I don’t have that much of a problem with, as long as they’re not licensed abusively.

JT: Right. I mean in one sense, a patent is kind of an open sharing of information because in patenting something, you have to kind of open kimono everything about it.

JW: Right. And we just launched a project today with Nike and Best Buy, which is trying to rebuild some of the traditional academic research exemptions available to it to patent portfolios and private companies, especially for sustainability technology. And as part of that, we’re going to be exploring how to deal with one of the core patent problems, which is what people would call the hold up problem or the thicket problem which is when — right now what happens is there is no research exemption, and there are no sort of click and download and sign contracts for patents. And that means that what happens is basically you go create a value, but you create value while you’re violating someone’s patents. And then after you’ve created value, you have to go negotiate, which is the worst time to negotiate. Because they’ve done nothing and you’ve made it valuable and now you have to give them money.

Now if you could get to the point where before you tried to create value, you could simply scan the patent landscape, identify some of the technologies you wanted to work with, and acquire the rights before you created value, or at least understand what those rights would cost you, that significantly lowers a lot of the problems that we’re dealing with. And if you could put that together with some meaningful crowd sourced review of prior art and some judicious narrowing of the scope of patents for things like genes, I think the patent system works a lot better right now than the copyright system does. I think it needs tweaking in a lot of ways, not sort of a dedicated common use licensing system like the GPL.

JT: In some ways, I was just thinking what you’re really talking about is something more analogous to, if I go and say I want to build a .NET based application, I don’t go build it and then Microsoft says, “Oh, that’s pretty valuable so you owe us a million dollars. But if it isn’t valuable, you only owe us $10,000.”

JW: Yeah. It’s not utterly dissimilar to that. I think that’s the kind of thing that could make the patent system we have function a lot better. I mean could you imagine if every time you built software, then you had to go get the rights from Oracle to use the database? I mean it’d be great for Oracle.

JT: Yeah. I was going to say I think Oracle likes that business model, but —

JW: Well, Oracle might like that business model, but I would argue that the total number of people who’d be willing to write code to Oracle goes up as a function, right? So in the sciences, the example would be the polymerase chain reaction which is sort of the Xerox machine for DNA. This was a fundamental invention. It enabled the emergence of the biotech industry. But it was made available in a nonexclusive patent license that was standard. And everyone signed it. Whereas if you’d had to negotiate, if you’d had to go argue, only people who are really skilled at negotiating would’ve gone and gotten the rights.

And a lot of application — a lot of that sort of churn of stuff that didn’t turn out to be useful, that would’ve been chilled. And so maybe the one in 100 idea that exploded and became a major technology wouldn’t ever have tried because no one would’ve believed it. It’s very hard to quantify what doesn’t happen as a result this stuff. You can’t get a number on it. And that’s one of the hardest things economically about making the argument is you’re trying to argue about how much innovation isn’t happening.

JT: That’s kind of like the how much music isn’t sold because of piracy.

JW: Right. It’s hard to figure that out. And that’s why we tend to try to focus on the opportunities that are available to make money, as well as to have the open innovation happen. We want both of those things to happen. Markets are not an evil thing. But you want those markets to be operating on a set of rules that really trends towards social good, too. And the commons, the voluntary contractual private commons appears to be a pretty good way to get some of those market forces going in that direction.

JT: One area that is clearly under attack is the traditional model of the expensive scientific journal, through mechanisms like the Public Library of Science. How successful is that movement being?

JW: Well, I mean I would say that it’s become an adolescent? Which means it’s trying to steal dad’s car, and it’s acting up. It’s made it out of early childhood, that’s for sure. The Public Library of Science has become a very high-impact, very respected journal publisher. It’s at the highest levels of scientific quality. And their business model is still developing. And I think that their new PLoS ONE venture, which is a new online only thing, and their upcoming hubs work which is going to build communities, those are going to be really interesting things to watch.

In terms of sort of proving itself from a business perspective, BioMed Central, who has nearly 250 journals, I believe, under Creative Commons licenses, was sold in December to Springer. My understanding is that BMC’s annual revenues were in the 15 million pounds per year range. Again, not using any sort of copyright transfer when they were bought by Springer. And so that really was, I think, a vindication of the capability of a for-profit model that was open. And I love to point to Hindawi, which is in Egypt, which is also profitable, which has another few hundred journals under C.C. by license. So we’re certainly seeing some proof points that this can be high-quality and this can be profitable. But there’s still a lot of uncertainty as to how the existing journals adapt to that. It’s much easier to start from scratch with a new model than it is to change midstream. I mean, if you think back to DEC and how hard the transition to the microcomputer and the personal computer was for DEC. This is no less fundamental of a change than the change from the mainframe to the microcomputer in terms of models for these publishers.

And so it’s very scary. And I think we have to be open and honest and accept that as a valid emotion, and try to work with the existing publishers and especially with the scholarly societies who don’t have the million dollars to invest in R&D on scientific publishing that a big publisher might have. These are operating to the bone. And so we need to work with them, and help them find ways to make the transition in a way that doesn’t destroy them at their core.

The danger is that this was the way the music industry went. I mean that something like iTunes comes along and kills the industry. We don’t want that to happen. We want a healthy ecosystem. We want a competitive market. We want robust publishing houses inside scholar societies. But we want to move that into a direction that allows this sort of remix and mash up of information in science. And so we just have to help them find models, publishing models, business models, legal models that help them make that transition and be part of the solution with them.

JT: The volumes of data that are being produced in science today are somewhat staggering. You have petabytes for the Large Hadron Collider, if they ever get the thing running, as well as huge amounts accumulating in genomic and bioinformatic databases. How can the scientific community effectively share these kinds of massive collections?

JW: You know, it’s hard to say. I mean some of these telescopes that are going up in the next five years, they’re talking about a petabyte a minute, you know? It’s just staggering. And so that’s stuff that from just a physical perspective, pushing those bits around is going to be slow. And so we’re not going to have lots of copies of these things. But I think that we have a couple of things that have to happen. One is that the communities involved have got to come to some agreement on meaning. And by meaning, I mean sort of standard names for things and relationships between things. Ontologies. Hierarchies. Taxonomies.

Things like data models for the SQL database but at a global web scale. Because in the absence of those, these are just piles of numbers and letters. And that’s hard to do. It’s really hard if you expect there to ever be sort of a final agreement on that list of names and relationships. So we’ve got to find both technical ways and social ways to have lots of different points of view represented and evolving and integrating.

And so if everyone locks up their own point-of-view, if everyone locks up their ontology, if everyone locks up their data model, then it’s unlikely that we’re going to get to this world where you can sort of pop one in and pop one out. So I think that’s the first thing we have to have, is that we’ve got to have a web of data. And a web of data means we need common names for things, common URLs, common ways to reference things. And that is beginning to happen. Those battles are being fought on obscure list serves and places like the Web Consortium and in different scientific disciplines. But those are going to become hot button topics outside the sort of core Semantic Web geek community in the next couple of years because they just have to. Otherwise, this data just becomes, again, a pile of numbers.

Another thing that has to happen is we’re going to have to develop some meaningful ways to federate that data, so that it’s not vulnerable to capture or to failure. And when you’re talking about data at that scale, it becomes important to understand what to keep and what to forget. And we’re not very good right now at forgetting stuff. It’s not in our culture, because it just felt like we could just keep storing everything. But if you’re talking about an Exabyte a day, and you have ten projects doing an Exabyte a day, the cost of storing and serving that is such that it’s unlikely that it’s going to be widely mirrored. So we have to find some way to either federate that stuff or to figure out what to forget. And I think figuring out what to forget might be the hardest part.

JT: Right. That’s almost like the guy who’s got a million boxes in his attic. And when you ask him if he’ll throw it out, “Well, I might need it some day.”

JW: Right. You never know. So if we’re going to do that, then we have to federate. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it’s meaningful, it’ll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?

So we did a lot of serial analysis of gene expressions called SAGE. It was looking at one gene at a time and seeing what it did. We can now look at the entire human genome. Should we keep the old data? We have a new machine, gives us high resolution, more accurate data. Should we keep the old stuff? Right? No one’s really dealing with these questions yet.

JT: Right. I mean you think about in astronomy data, you may have higher resolution data, but that old data may have the asteroid that was moving through at that particular time it was taken, and you’d really like to have that data.

JW: Right. So discipline by discipline, the norms over what to forget and what to keep are going to be very different. Right? Anything that measures time or moments in time, you know, geographic time, astronomic time, that’s probably going to be really valuable. But 50-year-old physics experiments, maybe not so much. Fifty-year-old genomics experience, maybe not so much.

JT: Unless it’s drifting the genome, in which case, it would be.

JW: Right. I mean this is — you and I could talk about this for about two hours. It’s a really hard question. There’s typically a really good argument on every side.

JT: Right. It’s whoever has the best budget for disk drives.

JW: Right. And ideally, what we could say is, “Okay. Well, we can delete the old data if there’s a physical sample.” Right? Because we can go back and recreate the genome data from an even smaller piece of it to check genetic drift. Right? So suddenly, we’ve gone from the world of the digital back to the world of the physical samples as well.

JT: So is there some data that’s just too sensitive or too potentially data to share generally? Where do issues like privacy or national security factor into a project like Science Commons?

JW: So privacy is very important. There’s a very strong privacy right in the United States on personal health information. And certainly, data about people that can be identified about their health, about their lives, that is stuff that really ought to be in the control of the individual. So from a legal perspective, from a copyright perspective, it would be in the public domain, but it would be subject to privacy regulation. And in my ideal world, people would be able to make informed choices about when and how that data got shared.

As it is — and the informed part of that tends to be pretty hard to achieve. For a long time, your health records were less protected than your video rental records. So the regulation of data is sort of scatter shot. But I really hope in particular that individuals are empowered to take charge of their own data. When it comes to national security, I tend to think that if the government collects data that’s national security data, then it’s sort of up to them whether or not they want to release that stuff. When it comes to things like infrastructure, it’s pretty easy to find all the data you want about power plants just using Google Earth. And the genie is sort of out of the bottle in terms of geographic information and a lot of other information.

It’s incredible what you can find on the web if you’re relatively good at web searching. And so my instinct is that we should use the power of the community and the crowd to try to mitigate risk rather than trying to suppress information to prevent the emergence of risk. I think the risk is there, I think, for things like someone could reconstruct the 1918 flu genome. So we need to share as much as we know about the flu genome so that if that happens, we can intervene.

JT: Right. I was, in fact, talking earlier tonight with someone who’s involved in synthetic biology and you get into this question if those sequences are available and now you have commercial services where you basically send a sequence and get back DNA, that you’re getting into the realm where it gets to be kind of easy to do that kind of stuff or easier.

JW: Yeah. And I mean I draw these statements from my conversations with Drew Endy, in particular, in the synthetic bio community which is that that information is available. You can get — it’s like 40 cents a letter at mrgene.com, I think, to sequence genes now, to synthesize them. You can get it done all over the world. You can get it done in Pakistan, you know? What we have to be is ready to deal with it. And that requires the information sharing being robust enough that if something happens, we can rapidly identify it, understand the risk, and intervene against it.

JT: All right. Back to your questions. What are the challenges ahead for Science Commons? And what do you think are the most significant short-term impacts the project will make?

JW: I mean the biggest challenge, I think — I mean there’s a constant challenge of funding because funding non-profits is always hard. In this economy, it’s almost lethally hard. But beyond that, I think the biggest challenge is what we started with which is that the existing systems for science are pretty robust against disruption. And working against that, trying to get to a point where scientists see the value of sharing and indeed, believe that they can out compete other scientists if they share. But that’s the biggest challenge because you’ve got to do so many things simultaneously. You’ve got to deal with legal problems, both contract and intellectual property problems. You’ve got to deal with incentive problems. You’ve got to deal with workload and labor problems. You’ve got to deal with the Guild culture and the Guild communication systems, all of that at once. And that’s really hard. So getting through this collective action problem is probably the hardest thing we’ve had to do from the beginning and will continue to be the hardest thing we have to do.

In terms of in the short-term, at ETech, we’re going to announce a pretty major partnership around some of our technology work with a major technology company. So we’re going to issue a big press release and all of that, so I won’t get into detail. But we’re going to be announcing the integration of our open source data integration project which we call the NeuroCommons, which we hope will become a major cornerstone of the Semantic Web and integration of that with one of the world’s largest software companies. And that’s going to be, I think, a big first step. We’re also looking to get a collection of major biological materials available for the first time under something that looks an awful lot like a one-click system. So that no matter who you are or where you work, you’ll be able to order the kinds of materials that were previously only available to members of sort of elite social networks. And that’s also going to be coming out in the coming weeks.

JT: So I’ll be able to get like Amazon Prime to deliver my restriction enzymes for me?

JW: Exactly. Exactly. Under a Science Commons contract. Under something that looks an awful lot like a Creative Commons contract. And then today, we announced a project with Best Buy and Nike around how to share patents and recreate the research exemption in the United States. And we expect that to be a pretty big deal this year as well. The last thing is that in the coming weeks, we expect — and I don’t know if this will happen before ETech or right around ETech — we expect one of the world’s largest pharmaceutical companies to make a dedication of hundreds of millions of dollars worth of data to the public domain.

Now, we can’t take credit for that. They made this decision before us. But we are going to help them do it. And I think that’s going to be something that is as big as IBM embracing Linux in the late 90s in terms of really building an open biology culture.

JT: So beyond the announcement you just mentioned, you’re going to be speaking at the Emerging Technology Conference on your work. Can you give us a feel for what you’re going to be talking about?

JW: Sure. I mean I’m going to go through a lot of the stuff that we’ve talked about here. I’m going to go through the ways we think the system is resistant to disruption, the reasons why a digital commons is a good tonic for that problem and some experience from the road. We’re going to talk about what it’s been like to go out and actually try to build the fundamental infrastructure for a Science Commons and give some examples of our work and then make that big announcement about the partnership with the technology company.

JT: Well, I’ve been speaking today with John Wilbanks who is the Vice President for Science at Creative Commons in charge of the Science Commons project. He’ll be speaking at the Emerging Technology Conference in March. Thank you for taking the time to talk to us.

JW: Thanks a lot.

ETech Preview: Science Commons Wants Data to Be Free

Get the O’Reilly Data Newsletter