Joe Stump on data, APIs, and why location is up for grabs

I recently had a long conversation with Joe Stump, CTO of SimpleGeo, about location, geodata, and the NoSQL movement. Stump, who was formerly lead architect at Digg, had a lot to say. Highlights are posted below. You can find a transcript of the full interview here.

Competition in the geodata industry:

I personally haven’t seen anybody that has come out and said, “We’re actively indexing millions of points of data. We’re also offering storage and we’re giving tools to leverage that. I’ve seen a lot of fragmentation.” Where SimpleGeo fits is, I really think, at the crossroads or the nexus of a lot of people that are trying to figure out this space. So ESRI is a perfect example. They have a lot of data. Their stack is enormous. They answer everything from logistics down to POI things, but they haven’t figured out the whole cloud, web, infrastructure, turn-key approach. They definitely haven’t had to worry about real time. How do you index every single tweet and every single Twitter photo without blowing up? With the data providers, there’s been a couple of people that are coming out with APIs and stuff.

I think largely, things are up for grabs. I think one of the issues that I see is as people come out with their location APIs here, like NAVTEQ is coming out with an API, as a developer, in order to do location queries and whatnot, especially on the mobile device, I don’t want to have to do five different queries, right? Those round trips could add up a lot when you’re on high latency slow networks. So while I think that there’s a lot of people that are orbiting the space and I know that there’s a lot of people that are going to be coming out with location and geo offerings, a lot of people are still figuring out and trying to work their way into how they’re going to use location and how they’re going to expose it.

How SimpleGeo stores location data:

The way that we’ve gone about doing that is we’ actually have two clusters of databases. We’ve essentially taken Cassandra, the NoSQL relational store that Facebook made, and we’ve built geospatial features into it. The way that we do that is we actually have an index cluster and then we have a records cluster. The records cluster is very mundane, it’s used for getting the actual data. The index cluster, however, we have tackled that in a number of different ways. But the general idea is that those difficult computations that you’re talking about are done on write rather than read. So we precompute a number of scenarios and add those to different indexes. And then based on the query, use the most appropriate precomputed index to answer those questions.

We do that in a number of different ways. We do that through very careful key construction based on a number of different algorithms that are developed publicly and a couple that we developed internally. One of the big problems with partitioning location data is that location data has natural density problems. If you partition based on UTM or zip code or country or whatever, servers that are handling New York City are going to be overloaded. And servers that are handling South Dakota are going to be bored out of their minds. We basically have answered that in a couple of different layers. And then we also have some in-process stuff that we do before serving up the request.

The virtues of keeping your data API simple:

We started out with what we considered to be the most basic widely-needed use case for developers, which is simply “my users here tell me about points of data that are within a certain radius of where my user’s sitting.” And we’ve been slowly growing indexes from there.

Our most basic index is that my user is at this lat/long, tell me about any point that I’ve told you about previously, within one kilometer. That’s a very basic query. We do some precomputing on the write to help with those types of queries. The next level up is that my user was here a week ago and would like to see stuff that was produced or put into the system a week ago. So we’ve, again, done some precomputing on the indexes to do temporal spatial queries.

A lot of users, once they see a storage thing in the cloud, they inevitably want to be able to do very open-ended abstract queries like “show me all pieces of data that are in this zip code where height is this and age is greater than this, ordered by some other thing.” We push back on that. Basically, we’re not a generic database in the cloud. We’re there specifically to help you manage your geodata.

Why NoSQL is gaining in popularity:

I think that this is in direct response to the relational database tool chain failing, and failing catastrophically at performing in large scale, real-time environments. The simple fact is that creating a relational database that’s spread across 50 servers is not an easy thing to build or manage. There’s always some sort of app logic you have to build on top of it. And you have to manage the data and whatnot.

If you go into almost any high performance shop that’s working at a pretty decent scale, you’ll find that they avoid doing foreign key constraints because they’ll slow down write. They’re partitioning their data across multiple servers, which drastically increases their overhead and managing data. So I think really why you’re seeing an uptake in NoSQL is because people have tried that tool chain; it’s not evolving. It basically falls over in real-time environments.

The role of social networking in the demise of SQL:

I bet if you track the popularity of social, you’ll almost identically track the popularity of NoSQL. Web 1.0 was extremely easy to scale because there was a finite amount of content, highly cacheable, highly static. There was just this rush to put all of the brick and mortar stuff online. So scaling, for instance, an online catalog is not that difficult. You’ve got maybe 50,000 products. Put it in memcached. MySQL can probably handle those queries. You’re doing maybe an order every second or something. It’s high value and not a lot of volume. And then, of course, during Web 2.0, we had this bright idea to hand content creation over to the masses. If you draw a diagram and one circle is users and one circle is images, scaling out a whole bunch of users and scaling out a whole bunch of pictures is not too difficult, because you run into the same thing where I need to do a primary key look-up and then I need to cache in memcached because people don’t edit their user’s data that often. And once they upload a photo, they pretty much never change that.

The problem comes when you intersect those social objects. The intersection in that Venn diagram, which is a join in SQL, falls over pretty quickly. You don’t need a very big dataset. Even for a fairly small website, MySQL tends to fall over pretty quickly on those big joins. And most of the NoSQL stuff that people are using, they’ve found that if you’re just doing a primary key look up 99 percent of the time on some table, you don’t need to use MySQL and worry about managing that, when you can stuff it into Cassandra or some other NoSQL DB because it’s basically just a key value store.

How NoSQL simplifies data administration:

Essentially, there are a lot of people out there that are “using MySQL,” but they’re using it in a very, very NoSQL manner. Like at Digg, for instance, joins were verboten, no foreign key constraints, primary key look-ups. If you had to do ranges, keep them highly optimized and basically do the joins in memory. And it was really amazing. For instance, we rewrote comments about a year-and-a-half ago, and we switched from doing the sorting on a MySQL front to doing it in PHP. We saw a 4,000 percent increase in performance on that operation.

There’s just so many things where you have to basically use MySQL as a NoSQL store. What people have found is, rather than manage these cumbersome MySQL clusters, with NoSQL they get those basics down. With MySQL, if I wanted to spread data across five servers, I’d have to create an application layer that knows that every user that starts with “J” goes to this server; every user that starts with “V” goes over here. Whereas with Cassandra, I just give it token ranges and it manages all of that stuff internally. And with MySQL, I’d have to handle something like, “Oh, crap. What if that server goes down?” Well, now I’m going to have three copies across those five servers. So I build more application logic. And then I build application logic to migrate things. Like what happens if I have three Facebook users that use up a disproportionate amount of resources? Well, I’ve got to move this whale off this MySQL server because the load’s really high. And if I moved him off of it, it would be nothing. And then you have to make management tools for migrating data and whatnot.

Cassandra has all of that stuff built in. All of the annoying things that we had at the beginning, NoSQL has gotten those things really, really well. Coming from having been in a shop that has partitioned MySQL, Cassandra is a Godsend.

Joe Stump on data, APIs, and why location is up for grabs

The SimpleGEO CTO and former Digg architect discusses NoSQL and location's future

Get the O’Reilly Data Newsletter