I spent Tuesday geocoding the addresses of the attendees of this year’s O’Reilly Emerging Technology conference (ETech) into latitude-longitude pairs. I learned how dirty address data is (very) as well as how far the attendees had come (on average, 2020km). Then I consoled myself by making pretty pictures with Excel.
A significant number of our attendees come from California, but not as many as we had thought. This is good news, because it gives us flexibility in where we hold the conference. The average distance travelled was 2020km, which reaches to Vancouver and Houston. 80% of our travellers came from Baltimore or closer. Very few of our attendees were from beyond the continental US, which could be an interesting opportunity for us. In total, attendees came slightly over one million miles to be at ETech 2005.
The Dirty Details
(Caution: explanation of how pretty graph sausage is made may turn the stomachs of the purists, or confirm the worst suspicions of those born distrustful of straight lines and clean edges)
I fed the attendee list through the geocoder.us web service (powered by Schuyler‘s wonderful Geo::Coder::US module) to resolve addresses to lat-long. I estimate that I got about an 80% success rate. In other words, 20% of people either didn’t spell their street or city in a geocoder-approved fashion (e.g., “Suite 250” in the address line tended to bugger it up) or couldn’t spell their street or city to save themselves. Yes, Chris DiBona, I mean you. “Monutain View” indeed.
It was while manually processing the remaining 20% that I realized my first mistake: I was manually processing the remaining 20%. The correct thing to do at this point would be to have written a Perl filter to correct the addresses. Instead I was treating them as a long list of one-offs. Too late, but lesson learned.
Another problem I ran into was P.O. Boxes. These don’t resolve to lat-long. Instead I ended up Googing for the city name and taking whatever center-of-city lat-long I could find. I suspect I ran into another problem here, where one data source uses one model of the earth’s sphericity to locate points, and another data source uses an entirely different model. Less of an issue for me, I think, given how much of the data was, shall we say, extrabuttular.
Along the way of Googling for city names, I discovered that if you search for an address, get Google to map it, and view the source to the page, the lat-long is in there. Off to the interweb to find this lovely script which uses Google Maps to do geocoding. The best part of this is that Google Maps also geocodes Canada and the UK. Huzzah! Of course, this is rather hacky and not guaranteed to stay around (or even be legal). (We’ll be talking about other legal free working APIs for geocoding, mapping, etc. at Where 2.0, the way)
Then I realized I only had one attendee from London. This did not match with reality. I distinctly remember drinking with what seemed to be the entire staff of the BBC, down to the man who polishes the Queen’s false teeth before her Christmas address. Aha! I had failed to account for speakers–they’re in a separate table from regular attendees. Back to the database, this time run through the snazzy Google geocoder, and out comes more lat-longs. Not as many as I’d hoped for, though: most of the speaker records were created by our speaker coordinator, not by the speaker, and so were lacking crucial parts of address data. For example, for one gentleman in the UK we had his street name and number but not the town in which he lived. There was a lot of “either city or country but not both” which was relatively simple to hack around (provided I was able to live with the assumption that there’s only one “London” in the world). More noisy data.
I ran the resulting mass of data through some Perl to calculate distances (thank you, Math::Trig) and shunted it into Excel. I ended up putting the data into 25km intervals and plotting a histogram. I had prepared a graph where a lot of people from a single location showed up as a plateau on the graph with a longer plateau meaning more people from that location. It wasn’t as easy to grasp as the histogram, so I decided not to go with it:
Lesson learned: geocoding is a nightmare. It was fine for me to fudge and use city centroids as approximations, but those kinds of bad locations would hurt the validity of conclusions if I’d been trying to reconcile attendees against known demographic areas. I really understand why Laser-Scan and companies like it are making businesses out of data quality. I also figure, though, that so long as I’m not trying to drop non-nuclear warheads on the houses of attendees (or figuring out to whom I should market Plasma TVs vs Cheez-Doodles), fudging to within 10 or 20 miles is near enough.
Next step … find a mapping platform that’ll let me visualize attendees on a map. I’ve begun to experiment with worldKit, so hopefully I’ll have something to show soon.