Where 2.0 Preview: Eric Gunderson of Development Seed on the Promise of Open Data

When we think about how government uses geographic information, we tend to think about USGS maps or census data, very centralized and preplanned projects meant to produce a very specific set of products. But Development Seed believes that there is a lot more that could be done if these types of data could be mashed up easily with each other as well as with alternate sources such as social networks. Eric Gunderson, President of Development Seed, will be speaking at the O’Reilly Where 2.0 Conference in June, and he recently took some time to speak to us about the potential benefits that open access to government data brings.

James Turner: Can you start by talking a bit about Development Seed and how you came to be involved with it?

EG: We’re a strategy organization in Washington, D.C., and what sets us apart from a lot of other strategy organizations in town is the fact that we do a lot of the building. And we build [it] all on open source tools. We particularly work with international development organizations, and the knowledge silos there are pretty fierce. For the last couple of years, we’ve worked on a lot of projects where you have really good data and bad technology’s slowing it down. So we work on a host of projects whether they’re internal internets or external mapping sites.

JT: If we focus, first of all, on our government, what are the problems with how the government manages data today?

EG: Right. Well, first, a lot of times it’s not even released. I mean people aren’t putting it out there in any kind of way where we can access it. But even when it is, for example, like a mandate by an agency to report on food prices or a certain statistic, sometimes it’s baked into PDFs. And it’s put out in a way that you can’t really do much with it, you know, interact with it, parse it out, discover what’s there. So that said, that’s starting to change. I mean there’s been some folks that are saying, “Wait a minute. We’ve already collected this data, and if we spend a little extra time packaging it, we can put it out there. And it will essentially have a whole new lifecycle and start adding value back to the community–the tax payers that paid for it.”

JT: Right. I know that traditionally one of the things that will happen is if you ask the government for a given data set, you’ll get like a stack of paper.

EG: No, I’m used to the FOIA [Freedom of Information Act]. Done right, hopefully this will reduce a lot of FOIAs. I mean a lot of this right now is really a discovery phase and a very expensive discovery phase for the government. If different agencies are saying, “Hey, listen, here’s the data we have on-hand,” and guide users to it on a spot online, I think they’ll actually reduce a lot of their workflow and workload and cost. We’ll get access to a lot more information, and if it’s put out in the right way, we’ll get access to a lot more information in a very timely manner. I mean, I certainly have some more dreamlike scenarios of how the data can come out through nice syndicated feeds and I can get updates of when new data’s posted. But hey, honestly, I’ll take CSVs. I can parse those. I mean that’s what we did with the D.C. government data.

JT: One of the projects Development Seed worked on was Stumblesafely, which mashed-up bar locations with crime data to help D.C. area bar goers pick the safest places to drink. Can you talk a little bit about the process of building that project and some of the lessons you learned?

EG: Yeah. That was kind of fun. So it’s stumblesafely.com. We were working late one night, and it was around the time Apps for Democracy was playing out in the city, which was back in October/November time. And the whole idea here, this is when Vivek Kundra was still CTO of D.C.. And he was like, “Listen, we got this data.” He put the data out online, and nobody was really doing anything with it. And he was trying to make an assessment of what data to focus on opening up in 2009. So he and iStrategyLabs came up with an idea to run a competition around the data. It was kind of like this Iron Chef competition in like, “Hey, we have these datasets; what can you build with it and hopefully draw some attention to the good work that D.C. government put into structuring their data, and also the creativity in the D.C. tech community.” So we were confined by what data the D.C. government had put out. But looking through it, it had some really cool things, like every morning they update crime data. And we were able to grab the geo locations of the data and could filter out and parse out based on what type of crime. And with stumblesafely, we looked at assaults and robberies; those are the Xs and Os on there. And we were even able to parse out when it happened. They would report based on what shift it was. So we could then filter based on your time of day that you preferred to drink, whether it’s daytime, evening or night. And so every morning, we’d go out with an aggregator and suck in the CSV data, unzip it and parse it out and put it live on the site.

What’s also really cool is how custom the mapping layers are. D.C. has also exposed a lot of shape files. I mean, we were able to grab building’s footprints, rivers, parks, and also, the alcohol and beverage control license. We turned the shape file for the ABC license into a heat map. So we’re able to make the city seem like it was a city at night and showed the glowing happening in the bar district. So those were the datasets we had. But the way, we were able to actually make this all work is because we took that data and were able to build on open source tools, specifically Drupal and Mapnik. Drupal’s an open source content management framework, like PHP, MySQL. And Mapnik’s a C++ mapping server with good python bindings. So we were able to pull together the site very quickly for the competition. And we were really just having fun. So by taking some ingredients like open data and some good open source tools, we were able to design a really dynamic site around that data, and that had some surprising results honestly. It turned out when we were down meeting with Mayor Fenty, we heard from the city administrator that the cops were actually looking at it, you know? Yeah. I mean of course they were, right? This is what data visualization’s about. I’m not sure the government ever would’ve paid to have a bar site made, but it’s nice to know it happened. And it’s a good example from the positive externalities from opening up the data.

JT: I think the most depressing thing looking at that map is the realization that there is no safe place to drink in D.C..

EG: [Laughter] That’s a detail that the data certainly doesn’t hide.

JT: It’s a very even distribution.

EG: That said, I certainly do drink in that area. They’re actually the bars around the office. So it’s also interesting to see how crime can really, really be highlighted where, at the same time, I’ve never experienced it in that area over the time-frame that it’s being mapped. So does it give me a little more pause? I’m not sure.

JT: Now, Kundra has moved on to the federal government where he’s going to be the United States CTO in some sense. Does this give you hope that he’s going to bring those policies with him?

EG: I mean absolutely. I hope they’re — forget the Apps for Democracy stuff. I mean, these are just ways for local governments to engage local developers and spread the word around data. I think there’s a major national craving for data, and people know what kind of data they want from the government. And I hear rumors about a data.gov site that could be a pretty powerful index of what’s out there. That said, I think it’s going to be really hard to do. It’s more than technology; there’s a certain culture that’s going to take a little while. Like, the government’s going to need a phase zero of starting to think about what data should go online and the workflow process to get it there. So, I mean, he’s going to have an interesting first year here.

JT: Tools like Google Earth have seen heavy use as a way for civilians and users to display datasets in a geographic fashion, but I really haven’t seen many KML files that were government generated at all. Is there a reason they’ve shied away from it?

EG: Right–that’s a great question. I wonder how they’re making their KML files. I think the actual lack of having some tools are causing the KML file generation by the government to be fairly time-intensive, hand generation. I mean I’m not sure how they’re being made. But I agree with you; I’m not seeing a lot of them either. I would love to see a lot more datasets with KML, CSV, RDF exports. And once you build the process to open the data, the format part’s going to be easy. So I don’t think there needs to be a whole ton of meetings around, “Hey, let’s get this in RDF or JSON or CSV.” I think we can take anything. But I do think that if it is in KML, it really lowers the barrier to entry for end users. So you really could see a lot more folks that aren’t Drupal, Mapnik programmers per se but that can just pop up Google Earth, grab the KML, grab another KML and start doing their own mash-ups. That’s going to be exciting.

JT: I have to say that looking at just recently the NEXRAD data that’s become available from the National Weather Service it’s really a different way to look at weather to watch it in Google Earth.

EG: Yes. It’s cool. I think with a lot of other data, it’s going to start changing that experience. We do a lot of — because we work in international development, some of the organizations we work with are working crisis scenarios. So, for example, look at what’s happening right now with some flooding in Fargo, right? I mean can you start putting news over weather and start seeing what’s happening. Some really creative discoveries can start emerging there. I think we’re going to start experiencing information in a new way once the government opens up the data.

JT: Part of the argument that government agencies make about making their data too freely available is that it can lead to the inadvertent release of security-sensitive information or the ability to infer information by correlating several different datasets. How legitimate do you think this concern is?

EG: I certainly think the government should be careful about releasing personal identifiable information. And that needs to happen whether they’re opening up data or not. Personally, I’ve also heard this whole conversation about watermarking data. A lot of this sounds like excuses to me. And if that’s the case, let’s let some of that data be some of the last data released then. But, honestly, there’s a tremendous amount of work to be done that has nothing to do with this. And it really seems like a conversation point that keeps coming up in meetings when, quite honestly, it doesn’t affect a lot of the data that we really want to open.

JT: This is my official GIS geek question here. As someone who’s tried to do simple mash-ups between different GIS datasets, I know that part of the problem, especially for the more casual user who may not do GIS on a day-to-day basis is that everybody uses their own projection. Some use UTM. Some use WGS84. I’ve seen NAD83. Is there anybody doing anything about this?

EG: That’s a good question. I was just up at Harvard Humanitarian Action Summit like four days ago and folks were asking about that, also. Our clients actually ask for different projections. It’s very political. And that said, I think there’s been a de facto standard set by Google using a map projection. And I mean we’re using Mercator projection on everything. It does make for some weird distortions, but — yeah.

JT: Are they using — what is it? They use WGS84?

EG: Yeah.

JT: And that makes sense because that’s also what GPS uses. And given that’s what people on the ground are using to find position, it kind of makes sense. You not only work with government; you also, as you mentioned, work with organizations. Can you tell me what kind of benefits organizations are seeing using these kind of hybrid mash-ups?

EG: The big part’s actually — I think organizations are really excited about web mapping servers and their ability to actually control the shape files. I mean if you look at the main web-based map options like Google, Microsoft’s Earth or Yahoo, there’s a lot of crap on there. If you’re doing data visualization, you’re doing it to tell a particular story. You really want to be able to control some of the underlying metadata. There’s really a demand emerging for very custom maps, especially I mean if you’re trying to influence the world, what better way to choose a projection to which people see the world.

JT: You’re going to be speaking at Where 2.0 in June. Can you give us a feeling for what your talk’s going to be about?

EG: I’m going to do a quick overview of how D.C. released the data and what it was like going after just raw data and where some of the pain points were; how other governments could better structure their data from a developer perspective and also, just share some fun stories about what open source tools are out there. My end goal is to have folks that have data walk out of the meeting with a better understanding of how to open their data up, and folks that want to be building sites with open data to know what tools are out there.