# Making Government Transparent Using R

## Danese Cooper thinks it will be an important tool in Open Gov

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

With Open Source now considered an accepted part of the software industry, some people are starting to wonder if we can’t bring the same degree of openness and innovation into government. Danese Cooper, who is actively involved in the open source community through her work with the Open Source Initiative and Apache, as well as working as an R wonk for Revolution Computing, would love to see the government become more open. Part of that openness is being able to access and interpret the mass of data that the government collects, something Cooper thinks R would be a great tool for. She’ll be talking about R and Open Government at OSCON, the O’Reilly Open Source Convention.

James Turner: Why don’t you start by describing where you came from, and you’re involved in, and what your interests are?

Danese Cooper: Okay. I’m Danese Cooper. I serve on the board of the Open Source Initiative. I have been serving for the last eight years. And I’m also currently employed by Revolution Computing, which is a start-up focusing on an open source language called R, as in the letter R, that is very useful for analytics and statistical analysis. I’m also an Apache member. And I also serve on an advisory board for Mozilla.

James Turner: One of the two panels you’re going to be speaking on at OSCON is on open source and open government. If you could talk a little bit about what interests you about open government and also what open government means to you.

Danese Cooper: Sure. Well, along with a lot of open source people, I got interested in the Obama campaign and in helping President Obama get elected. And part of why he was so compelling was that the vision of how Washington needed to change is pretty close to the way that we think about working collaboratively in open source. The night that he was elected, there was a great little clip on CNET of a Republican commentator actually explaining open source as exactly what I just said. It was a really brilliant little two-minute clip. He pointed at The Cathedral and the Bazaar, that canonical document about how open source works. And he said, “Microsoft is the cathedral. It’s their way or the highway. And the bazaar is a bunch of people working together grassroots to collaboratively build the things that they need. And so Obama’s basically asking for the government to become open source, and the problem is Washington isn’t really like that right now.”

So anyway, that’s the transformation that has to happen in order for government to really be transparent. To me, open source government is transparent government. There’s been an awful lot of shenanigans in recent political history, like the last decade has been pretty crazy in terms of things happening that couldn’t be traced back to any source. Even just the way we vote and the way that voting is managed, and the fact that the software that runs the machines that we vote on is not open source so it can’t be inspected. And nobody knows quite what it does. There are all of these stories of weird updates to the software that happened right before major elections in states where there are strange results. Transparency, in the same way that it helped the software industry transform, could really help the government transform. So that’s what I’m talking about. There’s a bunch of other people on that panel. My good friend, Brian Behlendorf, and I co-proposed it. And he’s actually taken the next step. He helped found Apache. And he’s run off to Washington to work on projects that are interesting to the Obama government to try to figure out how to help them to more open source solutions. And he’ll be talking about his progress on that panel. So I think it’s a pretty exciting panel.

James Turner: There seems to be this process when people go to Washington, that the most idealistic and pure motivated people in the world spend six months there, and they’ve turned into Gollum or something. It seems very hard to keep your ethics or if not your ethics, then your vision, and get anything done in that town.

Danese Cooper: Yeah. It’s been interesting to watch him [Brian] work hard to try to get something done that he can talk about. And as recently as last week, he wasn’t sure he was going to have anything that he could publicly talk about. And then they finally got through something. He called me up very excited and said, “Okay. Okay. Put my name back on the panel.” So I would say, in Brian’s case, I’d be very surprised if he compromises his actual vision. He’s a very persistent boy, chipping away at the big machine. And I think there a lot of people in Washington right now trying to chip it away. I think it’d be good if they all worked together. I’d like to see a little more community built around the people who are trying to chip away. I imagine that the status quo in Washington is feeling pretty embattled. And that’s probably making it very hard or making it as hard as they can for the change to happen. Which we saw in open source as well. I mean, people tried to fire me pretty much every year for the first five years I worked for Sun. And then in the sixth year, they all turned around and went, “Oh.” So it’s not a quick change. It happens slowly.

James Turner: There was a lot of hope and promise pinned on to President Obama’s shoulders when he took office. The open source community and the privacy community and the freedom community all had their vision of what he was going to do. Obviously, any presidency is going to be a mixed bag, partially for the pragmatics of the office and partially for factors we may not know all the story on. But there have been a number of decisions and policies over the last six months that haven’t seemed to really hold up to that vision of transparency. How do you see it six months in?

Danese Cooper: I agree there have been some decisions that were troubling. And particularly, the decision to retain the abrogation of our civil rights as envisioned by the Bush administration in the name of national security. I would like to see that decision rethought, but it is possible that the last time a president really did a lot of sweeping change of the type that he was talking about, the country was in such bad shape that people weren’t complaining too much about choices that that president made to override checks and balances. This is FDR I’m thinking of.

I always believed that Obama was going to be a centrist. Because he’s the first black president, he almost has to show up for everyone in America and not just the progressives. But I also think that his long-term vision is much more the kind of America that I’d like to live in than any of the other options that we’ve had recently. So I’m still pretty much behind him. There’s things that we can’t actually know the ripple-effect of, simple things like Michelle Obama planting an organic garden in the White House. That’s a big deal. I know people around the world that have started thinking about growing food on their land instead of ornamental things because of the highlight that that gives to gardening. And then she went the extra mile and said it was going to be an organic garden, not just a garden, right? So we won’t know what the overall effects of this first year or two years are going to be or even the first four years are going to be.

James Turner: I actually have a theory that the reason that the Democratic party is so anxious for Al Franken to get to Washington is not for the 60th vote in the Senate, but because anyone will then become looking less liberal compared to Al Franken.

[Laughter]

Danese Cooper: Yeah, he’ll be on the edge. That’s true. That’s true. Anyway, so we’re really happy that O’Reilly is so interested in seeing transparency come to government. When Tim and Jeff Bezos went out to Washington to look at the patent system, what was that eight years ago now? They were so discouraged that there wasn’t going to be change in our lifetimes because they didn’t see working computers on any decision-maker’s desk, right? I think that’s changing in this administration just because they’re bringing in a bunch of new people. And interestingly enough, the long-term effects of those kinds of changes in having people like Tim O’Reilly show up and do conferences like Gov 2.0 and also Micah Sifry’s Personal Democracy Forum, those are great, great ways to help people focus and keep track of what’s important because in this country, we get real excited in an election year, and then we kind of move on to the next thing until the next election cycle. So it’s important to keep that focus.

James Turner: One of the questions that you get into when you talk about transparency in government is the fact that at the end of the day, there are always going to be things that can’t be transparent. And that there are actually legitimate times when they say it’s a matter of national security and they mean it. How do you ride that rail and be able to trust them when they say, “We really can’t tell you”?

Danese Cooper: Yeah, that’s a tough one. I think that anybody who leaves this country, and if you want to do a favor for this country, go on vacation somewhere outside of it. Seriously. There was a great campaign here in San Francisco last year, Lonely Planet Guides are written here. And on all of the buses, there was this big banner campaign that said, “Do something for your country” in big letters. And then in little letters it said, “Leave.” Right? And I actually think that’s true.

If you leave this country, you experience how living in a fear-free media feels. There’s an awful lot of control through fear going on still in this country, even under the Obama government. Although the media’s not as bad as it was during Bush. But we’re not actually in the embattled Citadel situation that they want us to be in. That justifies the wars that they want us to fund. If you go to other countries, you don’t hear — I mean you hear about stuff that’s happening, but it doesn’t have that edge of “They’re coming to get us.” You know what I mean? So anyway, I’m not a big fan of the way that the current security administration, or any previous security administration in my recent history, has been run. I’m sure my security wonk friends would tell me I’m being naïve. But, oh my goodness, the other thing about leaving the country is going through airports, right? So any time you have to encounter the TSA, you can experience exactly how absurd security theater can become, right?

James Turner: So switching gears, the other thing you’re talking about and a big part of your professional life is the R language. Now I will confess that like Erlang, R is something that is on my radar and I see and I look at it and I say, “Okay. When am I ever going to use it?” I mean Erlang is used some places, but R I guess has a very nichey type of audience, doesn’t it?

Danese Cooper: You know, interestingly enough that’s changing. I think that’s been true. R has been in production or in development, let’s say, for the last 20 years. It is patterned after the S language, which was developed in the ’60s at Bell Labs around the same time that UNIX and C were being developed. And it was S for statistics, right? R is sort of a, “If we had known then what we know now” version of S. They’ve been working on it for 20 years in an academic setting. So it has been very slow to grow. But just in the last couple of years, it’s really gotten to a place where it’s ready for enterprise use. And just this year, the people that maintain S, a company called SAS, S-a-s, in South America, south of this country, have announced that they’re going to have to support R, like it’s that widely used now, particularly in schools. Every CS department, every statistics department uses R because it’s free and there are 200 books written about how to teach statistics in R. And so it’s become kind of a common lingua franca for people who are using analytic tools and computers.

We know that there’s little bits of R in Wolfram Alpha. We know that R gets used all the time like by the New York Times, by people who show quantitative data in the popular media, right? And I was quite surprised to discover recently in the predictive analytics world that Google, LinkedIn, and Facebook all use R to do really exotic things like predict user behavior. So I think Hal Varian said earlier this year, or maybe it was late last year, that statistician is actually the new sexy job in the tech world, being able to deal with data and make quantitative analysis of it and come to reasonable conclusions that guide business decisions is actually going to be a big deal. And so anyway, we think that R is poised to be much more widely used as the title of quant , the statistician title becomes a more important job title in the tech world, of course, there will be more quants, and they will be using the tool more. But, also, we think the alpha geeks will start to use it.

We know how to run our product on the cloud in EC2 and other cloud-related things. We’re seeing mash-ups where there was a recently a paper where somebody was mashing R with Hadoop. So we’re starting to see some interesting and unexpected consequences from the fact that it’s open source and freely available. And that’s pretty much why I got involved. Looking at, well, back to the question of open government, Vivek Kundra, who is kind of my current open source government hero — I don’t know if you know about him. But when he was the CIO of the District of Columbia, he opened 217 data streams. So just routine data that was collected in the District by various city services. So everything except where the cops are got opened up, including where the cops have just been; you know, the crime scene thing.

James Turner: Right. Right. We had an interview earlier this year with somebody who mashed up that data to show you the safe bars to go to.

Danese Cooper: Right, right, right. The StumbleHome thing? So just the fact that they were able to use that by opening that data and creating the mash-up opportunity, they got this amazing set of applications that were totally useful. And some of them, not that one, but there were a couple of others that they had earmarked real money to go have built. They were able to redirect that money back to education because they didn’t need to spend it on building the thing anymore. So we are starting to see services that offer analytics as a web service along with everything else that you can possibly apply to a data stream. And as more and more data becomes open, they’re going to need world class tools to analyze that data. And right now, R is the only open source tool.

James Turner: Just to put a little salt on the tail of thing because I’m, again, trying to understand where it fits in, I’m familiar because, God help me, I have a grad school wife, with SPSS which is kind of the traditional statistical analysis package. And I’m also familiar with analytics packages that go up against OLAP cubes and data warehouses and those types of things. How is this different from those kinds of tools?

Danese Cooper: Well, what SPSS is is S with a graphic interface. But it was written 15 years ago. So it’s an application; it’s not a service that you get over the web. So it’s elderly in that way. It also costs $10,000 a seat–$10,000. A legal seat of SAS costs \$35,000. So when open source comes into the picture, when there’s that much mark-up on software that was written a long time ago where it’s pretty much just a profit center, you discover that some of the money gets squeezed out of the market pretty quickly. It generally tends to level the pricing a little bit. And what has to happen is the open source tool has to be good enough, right? This is what happened with Linux. Linux was a toy until, all of the sudden, it was good enough to run financial services houses that didn’t want to pay the premium for Solaris anymore. And we’re right at that inflection point with R where it’s just finally gotten good enough that it can go up against some of these well established, expensive tools. Does that make sense?

James Turner: Sure. Do you have to be a statistician to know how to use it?

Danese Cooper: Well, you do right now, but that’s one of the things that we’re working on is how to make it more available to more people because it’s a pretty powerful tool. The tricky bit seems to be how to formulate your question in such a way that you can put the right data in the right part of the equation to get the right output, right? But I’m continually surprised by people who are using R — since I started caring about it, I have these casual conversations with my friends. My good friend, Ben Laurie, who’s a security worker at Apache and at Google, uses it for his hobby of creating fantasy knots. He’s into this “what if a protein could be bent the following ways to create an interesting shape.” And he uses a graph that he produced using R to look for anomalies in the shape equation that tell him where there might be an interesting knot to look at.

I mean, how cool is that? He said, “What are you working on now?” I said, “R.” He said, “Oh.” And he pulled down a book. I’m starting to see it in some very interesting places. So, for me, the beauty of open source, what open source did and what transparency will do for government, hopefully, is in the software industry, the producers of software had basically lost touch with their commercial customers. They started telling their customers what they wanted instead of asking them what they wanted. “No, no, no. You don’t want this. You want this other thing that we just wrote that we’re going to charge you a lot of money for.” And that disconnect is what allowed open source to come in the way it did. The developers who wrote open source software were writing to fit exactly the need that they wanted to fit. And they happened to be a customer base. So in the first approximation, it was a customer revolt involving people writing their own stuff, the customers writing their own stuff. But it got to a point, an inflection point, where it was good enough that people that weren’t capable of writing it could actually make use of it. And then we see companies like Canonical coming in and putting user interfaces on the top of Linux that make it usable by my mom, right?

So that’s going to happen to R as well. And probably to a lot of other open source stuff. And in the same way, we’re going to see interfaces, if we do it properly, for the general public in America to have better touch points with the government to give better feedback that isn’t money-driven. So that the real things that need to happen or the things that people really actually care about that aren’t necessarily tied to somebody’s pocket but are much more directly needs-based will emerge. Those are the kinds of things that need to happen. So anyway, I hope that tied it all together.

James Turner: It was deftly done.

Danese Cooper: Hey.

[Laughter]

James Turner: I have one other question just about the whole SPSS thing because, again, I’ve gotten more than my fill of it over the last year. We actually tried to swap out PSPP for it, which is the open source equivalent.

Danese Cooper: Yeah.

James Turner: And you find very quickly that unless it is an absolute feature-for-feature, bug-for-bug equivalent to SPSS, it can’t be used. They’ve got a real lock hold, don’t they?

Danese Cooper: Well, you don’t do query-for-query, right? I mean, people used to say this about Windows. When we first did Open Office, I don’t know if you know that I was involved in creating the Open Office Project at Sun. And we did it for competitive reasons, because we were suing Microsoft and we knew that it was going to hurt them. But I was also interested in doing it because it created an opportunity for the world to have access to productivity software in their own language. Because we made it localizable. And so a group of people in Romania, for instance, could get together and spend a weekend and produce a version of Open Office that was in Romanian so that their moms didn’t have to learn English in order to learn how to use a computer.

We thought that was really, really going to be a powerful thing. So the interesting thing is for a long time there, we would get notes from people who would say, “I really want to use Open Office, but it’s going to make me less productive because I know how to use Office.” And we’d get things from bosses who would say, “We want to switch the office over, but our secretaries are telling us that they won’t use it because they won’t be able to get their work done.” And we came up with — there was a forum about how to get your company to switch. And I always thought, “Gee, offer them the bonus of the cost of the legal license of Microsoft for one year and see if they aren’t interested in switching at that point.” And a surprising number of them were. And we found even some pretty sophisticated users like the French Foreign Ministry of Finance exclusively uses Open Office, even though there is actually Office in French, and they could use it. They chose to do the other thing. I just saw that the London Stock Exchange is switching off of Microsoft as well. So around the world, there have been people who did the analysis.

This is going to happen as well with SPSS and R. There’s a little bit of a switching cost, but for every person who has learned how to use SPSS in the last five years, there is a new pledge class of kids that have just been taught statistics using R. And it’s really surprising how overwhelmingly deeply taught it is. It depends on the department that you sit in. We’re finding, for instance, at Stanford, the poly sci guys use SPSS, but the CS people use R, right? So it’ll be interesting to see what happens. But you’re right. You can never do a feature-for-feature swap. And I don’t think we’d want to get in that business. It’s a tough business to be in. And it’s better to leap-frog past it to the next thing that everybody wants to do. You know what I’m saying?

James Turner: Right. We got as far with her professor as to get him to look at it, but then we hit ANCOVA or something like that and PSPP wouldn’t do it. And that was the end of the game.

Danese Cooper: Yeah. Well, try again because that’s a process, right? It took a really, really, really long time to make the switches that we’ve made in fundamental software companies. It took a long time for them to understand why they would want to make the switch. There’s still lots of companies where they talk about open source, but they actually don’t do the engineering in an open source way because they can’t quite get their heads around that process switch. And then we see little start-up companies that work completely agile, completely open and are sort of cleaning up in terms of user base I mean. So if user base is the new currency, then it’s a different race than it was before.

James Turner: So to finish up, you are going to be presenting two different panels at OSCON. I was wondering if there’s anything you’re particularly looking forward to seeing at OSCON?

Danese Cooper: Well, I always want to see Tim’s keynote, always, always. It’s always interesting to see what Tim’s thinking about. I saw him doing Ignite last month in Sebastopol and it was the best talk I’ve seen him do in a while. Let’s see. What else? I don’t care as much about the rest of the industry keynotes, the big industry keynotes. It’ll be interesting to see, I guess, if they have anything new to say. But I’ll be surprised. Deborah Bryant is also doing a panel about open source in government, and she has a different group of people than I do coming to her panel. She’s talking about people who are actually employed by the government now who are making small shifts or large shifts or incremental shifts. And I’ve got people who are trying to push the football further by taking a longer view, you know what I’m saying? So I’m going to be interested to go to her panel. There’s another panel about R or another talk about R that isn’t the one that we’re doing. That’ll be interesting.

I think Beautiful Data, the new book, is about to come out. And it’d be interesting to see stuff that relates to that. Along with an interest in transparency in government, the O’Reilly folks have a newfound interest in data, I noticed. Open data, the beauty of data, where the value is in data. Tim figured out a couple years ago that the value that open source squeezed out of the software stack just went to data, right? So he’s interested in the open data definition. We’re going to talk about that in the OSI meeting that we’ll have at OSCON. So I think there’s a lot of interesting stuff going on right now. I keep thinking open source is going to get boring and then another year roles around and there’s still plenty of stuff to talk about.

James Turner: Well, thank you so much for taking the time to talk to us. And we look forward to seeing you at OSCON.

Danese Cooper: Hey, sure. Thanks. It was really fun.

### Get the O’Reilly Programming Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

• Falafulu Fisi

James, I think that R is an excellent tool, but I am not sure why there is a need to push R for government to use, while other superior commercial tools that government agency can use, for instance, Matlab ? See, National Institute of Standards & Technology (NIST) still custom develop their stuff using Fortran, C, Java, etc,… , while Sadia National Lab are heavy users of Fortran, Matlab, C. They also made some of their software tools available to the public for free download.

So, while I like R to be universally adopted by government agencies, R is still not wide enough (numerical algorithmwise) to be able to meet the requirement of top researchers from various government agencies. This means that they will have to do some re-inventing, since certain established numerical libraries are still missing from R. Eg, R still has neither a robust DSP (digital signal processing) API nor a Control System API available, which are cornerstones of some government agencies based researchers. Of course there is some bits of R DSP here and there, but not a robust full functionality. In comparison to Matlab or other commercial tools, the standard & advanced APIs in those domains have been available commercially for more than a decade, so they are stable & robust. Using Matlab or similar commercial tools, will avoid researchers of having to re-invent certain numerical algorithms since they’re already come pre-packaged in Matlab.

I think that R, would be good for government agencies that just need it for data-analysis, where little or no algorithm development is involved.

• wen

One government agency that could support R extensively might be FDA. People used to choose SAS over R on the ground that FDA requires SAS in evaluating studies (myth?). Some standardized R packages or best practices would be helpful. On the other hand, do the big PHARMs care about using open source software?

• Tobias Verbeke

@wen: big pharma companies do care about open source software: some have their production ERP systems running on Linux. As far as R is concerned, I’ve seen 3 from the top 10 use it in validated and compliant environments (related to FDA submissions).

• Ishwor Gurung

@wen, @Tobias: Some of the core developer/s of R are from pharmaceutical companies (methinks Pfizer etc..) Now who say’s they aren’t already using R? ;)

- Started using R since couple of months now. Have to admit, it is superb in what I want it to do. Data extraction&modeling, bootstrapping, database all the usual stuffs..
– Loads of packages are available through CRAN and I bet they are growing.

Just like the saying goes that one language can not satisfy every bit of requirement; same goes for R. Takes more SNR to actually get things evolve with R (Just as with Python’s PEPs).

Peace out.

• Jay, Kenosha, WI

Danese Cooper’s highly partisan nature, which seems to be shared by all involved in the Open Government campaign, leads me to wonder if there is anything technically valuable in the movement or if it just another way to spread Hope and Change propoganda. This article taught me very little about R but quite a bit about Danese Cooper’s political views. What that the intent?

• Toby

@Jay, I don’t see anything “partisan” here at all. The concept under discussion here is transparency in government. What is partisan about that?

Re: Hope/Change/Politics trolling – well, as a non-American, I can tell you that Obama doesn’t Change as much as we Hoped. So get over yourself. And take Mr Cooper’s advice to travel; he is perfectly correct in the “paranoid echo chamber” analysis.

• Andrew

@Toby, without agreeing with Jay’s comment, I think you’re wrong to say that transparency in government is not partisan.

While politicians from many different parties may support it, for transparency to be effective, for it to morph into true openness, it will require a shift from representative democracy to participatory democracy. This is where you will start to see a break down in bi-partisan support for ‘transparency’, because then it starts falling into a contested space about what kind of democracy people ‘should’ live in. Discussions about normative values on one issue inevitably get mixed in with the values people have on other topics, and bi-partisanship is likely to diminish – at the very least.

You might want to look at Christopher Hood’s 2007 article in Public Management Review on ‘Transparency and Blame Avoidance’.