Patrick Collison Puts the Squeeze on Wikipedia

How to Cram the Wikipedia onto an 8GB iPhone

Think about Wikipedia, what some consider the most complete general survey of human knowledge we have at the moment. Now imagine squeezing it down to fit comfortably on an 8GB iPhone. Sound daunting? Well, that’s just what Patrick Collison’s Encylopedia iPhone application does. App Store purchasers of Collison’s open source application can browse and search the full text of Wikipedia when stuck in a plane, or trapped in the middle of nowhere (or, as defined by AT&T coverage…) Collison will be presenting a talk on how he did it at OSCON, O’Reilly’s Open Source Convention at the end of July, and he spent some time talking to me about it recently.

James Turner: Why don’t you start by talking about your background a bit and how you got involved with working with the Wikipedia?

Patrick Collison: I guess I’ve always been pretty interested in Wikipedia, and I ran my own MediaWiki installations back when I was in school in Ireland. We had our own personal ones and all of the rest. Then in November of 2007, I went to visit my friend in Japan for a month. And in Japan they have all of this incredibly advanced cellular technology and all of the rest. And so because of that, they had very few wireless networks, and my phone didn’t work. As a result, I actually had very little access to the Internet. I sort of realized without Wikipedia how little I really knew. And I had just got an iPhone, so I decided to try basically putting a copy of Wikipedia on the phone, so that I’d have it as I was walking around in Japan. Then basically, I spent a significant fraction of my time there in Japan, again, in 2007 writing those applications, say maybe two or three weeks, just firstly trying to decide if it was possible and putting it all together. And then it was released, I think, January of 2008.

iphone-article-large.pngJames Turner: Now you’ve also worked on getting it onto the OLPC I understand. How did that occur?

Patrick Collison: I actually didn’t do much of the work for this. It was actually a project led by Chris Ball who works both with FreeBSD and with the OLPC project. But I released the code to this application; it was open source from the very start. So it was pretty easy for them to take it and to port it to the OLPC. I mean there are already some applications that allowed you to put a copy of Wikipedia on your computer or something like that, but none had really been optimized for embedded or low power devices or anything like that, which obviously Wikipedia for the iPhone had to be. I think it took about two or three weeks to take the code that ran on the iPhone and then to bring it to the point where it’d run on the OLPC.

James Turner: There are obvious benefits to having Wikipedia on the OLPC, because connectivity is very important in some of those areas. So you’d want to have it local, but outside of the experience that you were just describing, isn’t the point of the iPhone that you can just access Wikipedia? What are kind of the advantages of having it locally?

Patrick Collison: I actually find that you spend, or I certainly spend a surprising amount of my time without access to the internet, even with the iPhone. Say for start if you were abroad, I mean everyone knows the horror stories of the data changes AT&T will issue you with if you’re roaming. But also just stuff like personally, I find that on a plane or something you have eight hours to not do much. And so I actually end up doing a lot of my Wikipedia browsing there. But even aside from connectivity issues, it actually turns out to be quite a bit faster to use the built-in, cached Wikipedia application as opposed to the website. I mean you can search in real-time with the applications. You just type a couple of characters and tap into your article, rather than firing up Safari or searching for the article in Google; then zooming in so you can tap in, et cetera, et cetera. I and most of the people I know who use the application actually end up using it even when they have internet connectivity. And maybe 20 percent of the time it’s pretty useful because it’s the only choice.

James Turner: Now just as a point of interest, is this an App Store app or do you have to have a jail-broken phone for it?

Patrick Collison: It was released back when only the jail-broken SDK existed. It was in that initial sort of surge of early applications. I guess the first jail-broken iPhone app, I think, happened in August, and so this was released just under six months later. And then when Apple announced the SDK, I actually originally did not intend to port it to the App store, just because I was just working on other things at the time and my company had just been bought and so it seemed like a lot of work. But then over the summer, I started getting a huge amount of email from people who had upgraded to the new version of the iPhone OS, and were now missing Wikipedia. And I started getting 20 or so emails from people per day saying they love this application and they were really missing it. Or even people saying they were continuing to use the old version of the OS just for this application. And they really hoped that I would port it so they could eventually upgrade. After receiving these emails for a while, I eventually felt too bad about not porting it. So I spent a couple of days porting it and then released it in the App Store. I wrote it and finished the port in August. And then it took about three months to wade through Apple’s approval process. Around the end of October, it was released in the App Store.

James Turner: Now most apps that you see in the App Store are relatively small in the couple of megabyte to tens of megabyte ranges. I understand this is about two gigabytes. Does that make it kind of unique or difficult as an App Store?

Patrick Collison: Yeah. I mean when I first went to submit it to the store I had done quite a bit of work getting it down to just marginally under two gigabytes, because two gigabytes was Apple’s stated limit. But it actually turned out that Apple’s infrastructure and their software was not able to handle two gigabyte applications or anything even close to it. I don’t know, but a couple hundred megabytes was the cutoff. That three-month approval process included them having to fix bugs and me having to change how the application worked and all the rest just so I could physically get it into the store. And so the way it actually works today is the application itself is extremely small. I mean just a couple of hundred K. And then you download the application. And then when you first run it, it includes its own sort of embedded downloader thing that allows you to download the Wikipedia from within the application. And it allows you to pause and resume the download and all of the rest. So this actually ended up being the only reliable way of making the download work.

James Turner: And presumably you want to do that on a Wi-Fi network if you don’t want to eat up most of your monthly bandwidth from AT&T?

Patrick Collison: Right, right, right. Or I mean I guess if you want, you can really test how honest AT&T are being by saying they’ll give you unlimited data.

James Turner: So how big is the uncompressed original source data for the Wikipedia? And how do you cram it down to two gigabytes?

Patrick Collison: So my memory is that it’s around 12 or 13 gigabytes uncompressed. The very first thing we do is it comes in this very verbose XML format. We have our own custom format that just includes the bare minimum amount of metadata. And then compacts it in this fairly space efficient binary way. We manage to strip out 20 or 30 percent of the content just by doing that. And then we apply bzip compression, which gets a pretty good compression ratio because, obviously, it’s text. Then we also remove some of the content from the applications, the kind of stuff that’s not particularly useful on the phone, We strip out the links, for example, to the article in other languages because those links don’t work in an offline application. We don’t have the other languages. We add links to pictures because we’re not storing the pictures. We strip out references because I’m assuming you’re not too interested in analyzing the minutia of the references when you’re using the phone, and that kind of stuff. And so that, again, saved us another 20 to 30 percent or so.

And really, that’s it. I mean what ends up being transferred to the phone is just this huge two gigabyte bzip2 encoded text file that we then index in various ways to allow it to selectively decompress various chunks when a user wants to load an article.

James Turner: So one other question I have is: how about the little things at the top of the article telling you why it’s a bad article or things like that? Do you keep that in?

Patrick Collison: Oh, the info boxes? Right. No, we actually don’t keep those. And one thing we’re considering right now, we’re actually working on a fairly substantial update at the moment, is kind of recognizing those — okay, to backtrack one step, I mean one thing we do not do at the moment is load any templates because it’s too computationally expensive on the phone. The way the Wikipedia website works is that you have this article and it has links to all of these other articles which are kind of special articles, you know, templates. And so to load one article might actually load say 20 articles. But loading one individual article on the phone takes about — well, certainly on the iPhones before the 3GS, takes around five to ten seconds. If you were to do that for ten articles, it would become unusably slow. And so because of that, we don’t load any templates. And, therefore, we don’t show stuff like the stub message or something like that or the neutrality being contested or that kind of stuff. But what we’re working on at the moment is making it, when we’re creating that specially prepared dump for the iPhone, that we recognize some set of templates and then include a special flag with the article title or something that notes for the really important info boxes like say neutrality is contested or stub whatever, that we’d then be able to display something on the phone.

James Turner: Wikipedia is obviously a very dynamic thing. It’s literally updated second-to-second. How do you deal with that?

Patrick Collison: Right. So, again, this was one of my concerns in the beginning. I mean I was unsure how useful it would be for that exact reason. But actually, it turns out, to be honest, in practice not to matter all that much. I mean the dump I currently have on my phone is actually from August of ’08, so it’s pretty old. And that causes obvious problems, like if I go to Barack Obama or take a look at something, there’s quite a lot that it does not have. But, at the same time, it’s pretty rare that I have the experience of wishing that my dump had something that it does not in fact have or something has happened since then. We considered doing updates in the form of deltas, where you could download say 100 megabytes per week to bring your dump totally up-to-date. But that actually ended up not being possible because Wikipedia’s own infrastructure can’t handle or at least was not able to handle weekly updates. And so they were releasing dumps every couple of months. And so there wasn’t really much point in us putting together a really advanced update schedule.

They seem to have kind of improved things over the last couple of months. And so I’m noticing that they are releasing dumps now it seems every week or two. And so we’re thinking again about the question of updates and what we could do. But we’re hesitant to put too much work into it just because it seems that the usefulness of the application isn’t affected that much by whether or not it’s six months old or six weeks old. I guess people are pretty used to the idea of any particular reference document not being totally up-to-date. And Wikipedia is really the exception by being up-to-date. But I mean the application for the iPhone, I guess, is conceptually more like a conventional encyclopedia, and that seems to sort of work out okay.

James Turner: How often do you take dumps? Oh, boy. That sounds like the wrong question.


James Turner: How often do you generate dumps?

Patrick Collison: Right, right. So it varies actually for the different languages. For English, just because it’s all so complex, quite infrequently. The latest one available right now is from, I believe, October. And we’re actually working on a new one at the moment. For other languages, it’s quite a bit more frequent. And I think most of the ones available at the moment are from March or thereabouts. And, like I say, we’re hoping to speed that up.

James Turner: You’re going to be speaking about this at OSCON in about three weeks. I was curious; is there anything else at OSCON that’s really caught your eye or you’re interested in?

Patrick Collison: Yeah. So I know that the Cloudera guys are going to be there. And so I’m really looking forward to that. I think they’re working on some pretty cool stuff with Hadoop. And so I’m really interested in having a look at that. And then, also, I’m a huge fan of the GitHub and what those guys are doing with kind of changing the dynamic of open source software and interaction and all of the rest. And I’m sort of interested to see — like for a long time, I feel like open source software generally was held back by SourceForge and its ilk. And sort of the — well, it’s frequently just not a particularly good infrastructure. Like for a long time, we never really moved beyond Source Forge and mailing lists and CVS or, I guess, now subversion. And it’s really interesting how Git has, through technology, started to change the sociology or social dynamics of open source software. And GitHub seems to be kind of continuing that with kind of the ecosystem of forks of different projects and allowing those to be later reconciled and all of that kind of stuff. And so I’m pretty excited about that. I’m looking forward to talking to those guys.

James Turner: All right. Well, Patrick Collison, thank you so much for talking to us. And we look forward to seeing you at OSCON.

Patrick Collison: Thank you.

tags: , , , ,