May 10

Nat Torkington

Nat Torkington

ETech 2005: Where Did They All Come From?

I spent Tuesday geocoding the addresses of the attendees of this year's O'Reilly Emerging Technology conference (ETech) into latitude-longitude pairs. I learned how dirty address data is (very) as well as how far the attendees had come (on average, 2020km). Then I consoled myself by making pretty pictures with Excel.

A significant number of our attendees come from California, but not as many as we had thought. This is good news, because it gives us flexibility in where we hold the conference. The average distance travelled was 2020km, which reaches to Vancouver and Houston. 80% of our travellers came from Baltimore or closer. Very few of our attendees were from beyond the continental US, which could be an interesting opportunity for us. In total, attendees came slightly over one million miles to be at ETech 2005.

The Dirty Details

(Caution: explanation of how pretty graph sausage is made may turn the stomachs of the purists, or confirm the worst suspicions of those born distrustful of straight lines and clean edges)

I fed the attendee list through the geocoder.us web service (powered by Schuyler's wonderful Geo::Coder::US module) to resolve addresses to lat-long. I estimate that I got about an 80% success rate. In other words, 20% of people either didn't spell their street or city in a geocoder-approved fashion (e.g., "Suite 250" in the address line tended to bugger it up) or couldn't spell their street or city to save themselves. Yes, Chris DiBona, I mean you. "Monutain View" indeed.

It was while manually processing the remaining 20% that I realized my first mistake: I was manually processing the remaining 20%. The correct thing to do at this point would be to have written a Perl filter to correct the addresses. Instead I was treating them as a long list of one-offs. Too late, but lesson learned.

Another problem I ran into was P.O. Boxes. These don't resolve to lat-long. Instead I ended up Googing for the city name and taking whatever center-of-city lat-long I could find. I suspect I ran into another problem here, where one data source uses one model of the earth's sphericity to locate points, and another data source uses an entirely different model. Less of an issue for me, I think, given how much of the data was, shall we say, extrabuttular.

Along the way of Googling for city names, I discovered that if you search for an address, get Google to map it, and view the source to the page, the lat-long is in there. Off to the interweb to find this lovely script which uses Google Maps to do geocoding. The best part of this is that Google Maps also geocodes Canada and the UK. Huzzah! Of course, this is rather hacky and not guaranteed to stay around (or even be legal). (We'll be talking about other legal free working APIs for geocoding, mapping, etc. at Where 2.0, the way)

Then I realized I only had one attendee from London. This did not match with reality. I distinctly remember drinking with what seemed to be the entire staff of the BBC, down to the man who polishes the Queen's false teeth before her Christmas address. Aha! I had failed to account for speakers--they're in a separate table from regular attendees. Back to the database, this time run through the snazzy Google geocoder, and out comes more lat-longs. Not as many as I'd hoped for, though: most of the speaker records were created by our speaker coordinator, not by the speaker, and so were lacking crucial parts of address data. For example, for one gentleman in the UK we had his street name and number but not the town in which he lived. There was a lot of "either city or country but not both" which was relatively simple to hack around (provided I was able to live with the assumption that there's only one "London" in the world). More noisy data.

I ran the resulting mass of data through some Perl to calculate distances (thank you, Math::Trig) and shunted it into Excel. I ended up putting the data into 25km intervals and plotting a histogram. I had prepared a graph where a lot of people from a single location showed up as a plateau on the graph with a longer plateau meaning more people from that location. It wasn't as easy to grasp as the histogram, so I decided not to go with it:

Lesson learned: geocoding is a nightmare. It was fine for me to fudge and use city centroids as approximations, but those kinds of bad locations would hurt the validity of conclusions if I'd been trying to reconcile attendees against known demographic areas. I really understand why Laser-Scan and companies like it are making businesses out of data quality. I also figure, though, that so long as I'm not trying to drop non-nuclear warheads on the houses of attendees (or figuring out to whom I should market Plasma TVs vs Cheez-Doodles), fudging to within 10 or 20 miles is near enough.

Next step ... find a mapping platform that'll let me visualize attendees on a map. I've begun to experiment with worldKit, so hopefully I'll have something to show soon.

tags:   | comments: 12   | Sphere It

Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/4030

Comments: 12

  Marc Hedlund [05.10.05 11:23 PM]

Yeah, Chris DiBona could have avoided spelling mistakes if he'd done some of that "Googing for the city name" you talk about. The Googe spellchecker is second to none. :)

  Chris [05.10.05 11:54 PM]

Ummm... where is the rest of Europe? I distinctly remember quite a headcount from Scandinavia. And Leeds.

Was there really only one paying customer from London?

The very very small hippy part of me says - 1 million miles of air travel is a lot. FutureForests say that's over 700 trees that need to be planted to balance CO2 emissions.

  Chris DiBona [05.11.05 12:02 AM]

Moutnin Vue? :-) At least I get airport codes right, it's all I need, man. I'm just the data packet, the cab/plane/train/car service/hertz neverlost/nav system is the router.

  Chris DiBona [05.11.05 12:02 AM]

Moutnin Vue? :-) At least I get airport codes right, it's all I need, man. I'm just the data packet, the cab/plane/train/car service/hertz neverlost/nav system is the router.

  Julian Bond [05.11.05 12:22 AM]

Europe and UK do seem to be under-represented in your data. There were times during the conference when it seemed like the Brits had taken over. Maybe it's just they were all giggling a lot, or that they were more active on IRC, or had taken over the bar. And what about the "Virtual SuW" Does a person from Wales attending via iChat count?

More seriously, the contact detail for attendees and especially presenters seemed surprisingly thin. For instance, I'm sure there were a lot more people with a Skype address than mentioned this. Perhaps next year the data entry form should have a FOAF import filter. And to make geocoding easier, make Zip/Postcode a required field. There are some good Zip/Postcode to Lat/Long databases freely available.

And lastly, it always seems like conferences are overly protective of all this data. I'd really like to see the contact list online in the run up to the conference and afterwards. It shouldn't be too hard to have an online registration form that has a privacy switch and where access to the data is only available if you've registered.

  gnat [05.11.05 07:44 AM]

Ummm... where is the rest of Europe? I distinctly remember quite a headcount from Scandinavia. And Leeds

The registration database also had a single Swede, who so distorted the graphs that I left him off as an outlier. I also remember a lot more paying Britons and plenty more Scandinavians. I think their address data was incomplete, which is why they don't appear on the graph.

The Googe spellchecker is second to none.

Ok, Marc, you've now gotten me back for my "glad you dressed up" line in front of Donna Dubinsky and Jeff Hawkins.

More seriously, the contact detail for attendees and especially presenters seemed surprisingly thin.

Let's face it, when you're registering for a conference, you're not interested in typing in a dozen different ways you can be contacted. Perhaps for future conferences we should add a field for "username on social networking system of choice" and let everyone else figure out contact details from there?

And lastly, it always seems like conferences are overly protective of all this data.

I've been noodling lately on the idea of a privacy-free conference. Think of it as David Brin's Transparent Society in conference form: you get RFID tagged badges, we have readers everywhere. There are ubiquitous web cams and microphones. All conversations are recorded and available online after the fact. No encrypted traffic allowed on the network, so everyone would create a disposable Yahoo! mail account for the duration of the conference. I've not decided what the sessions would be about--mining the digital data streams for information?

  David Sklar [05.11.05 09:03 AM]

I've been noodling lately on the idea of a privacy-free conference. Think of it as David Brin's Transparent Society in conference form

Yes, please do this as a wonderful demonstration of how the asymmetry of social relationships makes the privatopia explored in "The Transparent Society" mostly useless. Also to see how many conference attendees would agree with Brin that encryption is an "ornate and unproved technology".

  Ed [05.11.05 11:57 AM]

Congratulations on entering the wonderful world of mapping! The first, and biggest problem people often face when trying to create what should be a simple map is....crappy data. In your case, crappy address data.

That level of hit rates for unaided geocoding is relatively high...typically, there are a lot more spelling errors, local alias for street names, mis-entered zipcodes, etc. That's why geocoding is a particularly interesting problem in mapping...it has relatively little to do with actual spatial analysis and calculation, and a lot to do with: soundex-type spelling error correction, catching mismatches between street addresses and zip codes, providing different levels of fall-back geocoding, providing appropriate error codes and batch error correction, etc.

Also, most mapping software SHOULD have a relatively simple option to calculate distances (at least a straight line distance) from one point to all others. The use of trigonometry should be purely at one's discretion.

Actually, the other problem you mentioned - using similar Earth projections and datums - is very important when you're trying to create highly accurate maps...say, of the taxable area of your property, or of that fiber cable bringing you broadband....cases where the difference of feet matters. In other cases though - your generalized map view....that accuracy is neither necesary nor warranted. The REAL problem with different projections and datums comes in when you want to use one layer of map data with another, created in a different projection. Then, you need to start converting data formats, manipulating other attributes of the underlying spatial data....that's when it REALLY starts getting messy.

Just some comments from an industry insider...

PS: The Location Intelligence conf. in Philly was "same-old, same-old" if you ask me...

  Steven Citron-Pousty [05.11.05 01:47 PM]

I couldn't resist the sausage and gave some feedback on your analysis in my own posting.

  Jeff Carroll [05.11.05 03:35 PM]

If you were to develop the deep pockets that you need to buy Ed's software, you'd discover another thing about address data - it changes. A lot. I used to be development lead for a customer service mapping application at one of his large customers, and I got MapInfo address data updates far more often than I could imaginably make use of them.

I used to get frequent user complaints demanding that the application geocode military APO postal addresses for customers stationed overseas. When I asked what location I should provide, the customer indicated that I should return lat/long, and thus cellular coverage, for the post office that forwarded the mail. Years later, I've yet to imagine a situation in which that could have been useful.

I'm also frequently amused by Anglophones laboring under the impression that CEDEX is a suburb of Paris.

  Christian [05.14.05 04:14 PM]

Great entry. However, we were three people from Denmark. And since it's such a small country, we're always very keen to have it mentioned ;-)!

  ruhan [02.21.08 01:56 AM]

Artificial gerek to the machine while becoming general to load (yazilim) 1. A computer and her communication your machine is betwixt the datum cable which becomes will provide that obtain, 2. (Yazilim) you do not load the program she can valuable according to every trademark's every model make descend, 3. Make current (yazilim) descend


Post A Comment:

 (please be patient, comments may take awhile to post)

Type the characters you see in the picture above.