This is a full, unedited transcript. For a shorter version of the same interview, please read "Joe Stump talks location and NoSQL"

by James Turner

The NoSQL movement makes the argument that for a lot of data collections, relational storage is not only wasteful, but actually worse than doing it non-relationally. Joe Stump, CTO of SimpleGeo, has certainly found that to be the case for location data. Stump, who was formally the lead architect at Digg, will be speaking at the Where 2.0 conference at the end of the month, and recently talked to me about his current feelings, location-wise. He started by talking about how he became a location guru.

Joe Stump: I originally read a Slashdot posting. It was a comment, probably almost ten years ago where someone had mentioned that Apple was working on laptops, internally, that had GPS chips in them, and that a future version of their HFS file system would actually take lat/long as a journal component to a file. That was my light bulb moment for location, the idea of being able to say, "Show me all of the word documents I created at home." I think location is a really interesting contextual piece of metadata that makes other data much more interesting.

What we're currently doing at SimpleGeo actually sprang forth because Matt and I originally got together to do location gaming. Our very first game, that we had mocked up and started coding was called Moguls, and it's almost exactly what MyTown is. As a result of that, we started building this fairly comprehensive location infrastructure. And for various reasons, Matt and I never made games. Investors don't generally invest in gaming companies until later stages, et cetera, et cetera. We said, "This probably isn't going to work. So what can we do that will work? And do we have anything that we can salvage?"

We had the storage engine that we had built for the back-end for these games. I told Matt, "I think that's some really interesting technology there. We started talking that out, and one thing led to another. We asked, "Well, what if we became the S3 of location, this HTTP interface in the cloud where people could store and query and manage their POI and other location data?"

James Turner: I've done things like Dealer locator code, and just the math to answer the question, how close are two things, is not trivial; it's got some trig functions in it. When you start to do large scale stuff like that, it really doesn't scale very well. Is there a way to approach this where you can store location in a way that makes it easy to do these relational things.

Joe Stump: The way that we've gone about doing that is we've actually have two clusters of databases. We've essentially taken Cassandra, the noSQL relational store that Facebook made, and we've built geospatial features into it. The way that we do that is we actually have an index cluster and then we have a records cluster. The records cluster is very mundane, it's used for getting the actual data. The index cluster, however, we have tackled that in a number of different ways. But the general idea is that those difficult computations that you're talking about are done on write rather than read. So we precompute a number of scenarios and add those to different indexes. And then based on the query, use the most appropriate precomputed index to answer those questions.

We do that in a number of different ways. We do that through very careful key construction based on a number of different algorithms that are developed publicly and a couple that we developed internally. One of the big problems with partitioning location data is that location data has natural density problems. If you partition based on UTM or zip code or country or whatever, servers that are handling New York City are going to be overloaded. And servers that are handling South Dakota are going to be bored out of their mind. We basically have answered that in a couple of different layers. And then we also have some basic in-process stuff that we do before serving up the request.

James Turner: It almost sounds like you're applying a cell philosophy where when you're out in the middle of South Dakota, you have 50 mile cells. And when you're in New York City, you have two-foot cells.

Joe Stump: That's more of a typical archery approach, and we actually didn't use that approach. There are some other things that we are looking at to potentially work with that. But instead, the way that Cassandra works, the internal data management within Cassandra is just baked in. You can plug in your own partitioner, for instance, and we built custom partitioners that help us, using various algorithms that helped us avoid having to do those other things that you're talking about.

James Turner: Obviously if you precomputed everything that very quickly becomes an N squared problem. So I assume that you're doing some degree of granularity where you reduce the number of total points there?

Joe Stump: We started out with what we considered to be the most basic widely-needed use case for developers, which is simply "my users here tell me about points of data that are within a certain radius of where my user's sitting." And we've been slowly growing indexes from there.

Our most basic index is that my user is at this lat/long, tell me about any point that I've told you about previously, within one kilometer. That's a very basic query. We do some precomputing on the write to help with those types of queries. The next level up is that my user was here a week ago and would like to see stuff that was produced or put into the system a week ago. So we've, again, done some precomputing on the indexes to do temporal spatial queries.

A lot of users, once they see a storage thing in the cloud, they inevitably want to be able to do very open-ended abstract queries like "show me all pieces of data that are in this zip code where height is this and age is greater than this, ordered by some other thing." We push back on that. Basically, we're not a generic database in the cloud. We're there specifically to help you manage your geodata.

We've been adding other indexes where things make sense. One that we're working on now that's going to be coming out soon deals with people who say, "This is great, but without a reference point of lat/long, I'd like to know where all of the things that Joe has put into the universe, where those points live." So we're going to be doing a user ID index, and stuff like that.

James Turner: You mentioned GPS earlier, but location data comes in from a lot of different sources now. You have cell tower data. You have WIFI database data. You have GPS data of various resolutions. How much do you have to factor in the quality of the data when you're trying to figure these types of things out?

Joe Stump: It's actually interesting that you mention that, because we're working with Skyhook. Our philosophy is we will work with whatever quality you give us. So if you want to give us resolution that isn't very good, we will happily index that and return that to you without many issues. Our other thing, too, is that we're not in the business of doing triangulation and telling you where your users are on the device. That's a business that Skyhook is excelling at. So we're under the assumption that you already have lat/long and that you trust it to a certain degree already when you're storing it with us.

James Turner: It strikes me that, for instance, with Skyhook, they give you back a latitude and longitude as a final result but, for example, if they also had the IP data, they could probably tell you something like, "Oh yeah. He's in a Starbucks." Which could help you a lot more with the problem that if you're in an office building, you could be on any one of 100 floors. In terms of giving a contextual answer that this guy is in Starbucks, knowing that he's using a Starbucks IP tells you a lot. Are you thinking about that at all?

Joe Stump: Yep, absolutely. One of the phone calls I got off today was with a provider that gives business listings for the entire country. That's definitely something that we already have, a reverse geocoder for the US. We're actively pursuing other types, because, again, our entire business is exactly what you're saying: adding context to that lat/long. For instance, with the business listing database, they would, with a degree of certainty, be able to tell how close the users are, based on the lat/long we get from the phone to businesses, and make decent inferences from there. We already have census and zip code databases. The reverse geocoding can tell you that with a high degree of certainty, we can say that you're in this zip code; you're in this city and state. And then with a fairly decent degree of certainty, we can tell you what address you're at in the US.

With the business listing data, we return all of our result sets in order of proximity to the point that you've given us. So if you say I'm at this lat/long, we return points in ascending order from how close they are to that point. So if I tell you I'm at this lat/long and then you say well, show me business listings around there and the first response you get back is basically .1 meters from where you're at is a Starbucks, you can say with a pretty high degree of certainty that that person's sitting in Starbucks. The reverse geocoder works that way as well. You can say where is this lat/long and within a high degree of certainty, we can say that lat/long is in Boulder, Colorado. It's in 80302. And we think it's on the 1300 block of Walnut Street.

James Turner: I've talked to people from a number of different outfits that deal with place information, and one of the challenges is the colloquial ways that people want to refer to place, neighborhood names and also the hierarchy of place. How much are you thinking about that?

Joe Stump: A lot. That's actually one of our next major products. A lot of times, people refer to that as the cone? You start out with this little thing and you move out and get into big things. As a service, we'll be able to say you're in Delores Park. Delores Park is in the Mission District. The Mission District's in San Francisco. San Francisco's in California, and California's in the United States. That's absolutely on the horizon and something that we have on the roadmap and plan on working on within the next probably month.

James Turner: I would imagine that that introduces another layer of complexity, just in data storage, because it's getting, again, into a permuted problem at that point.

Joe Stump: We've been finding it's interesting. I spent the last two-and-a-half years at Digg, two of which I spent as their lead architect. Before that as well, going all the way back to my first job, I focused mostly in scalability and large web infrastructure stuff. I often say that scaling equals, basically, specialization. What we've been finding is we have some base tech that works for a lot of great use cases. We're going to continue augmenting that and building that out. But we're also finding certain things where that may not work and what we're working on is building out other services and other back-end solutions that basically answer that. At this point, everything that we've built has had to have been highly custom, so yeah, it's been interesting.

James Turner: It seems like there are a couple of different people who are trying to vie to be the guy who gives you data about geography. Is there a little bit of a Tower of Babel effect here now? How much do you think duplication of effort is going on?

Joe Stump: You know, it's interesting. I personally haven't seen anybody that has come out and said, "We're actively indexing millions of points of data. We're also offering storage and we're giving tools to leverage that. I've seen a lot of fragmentation." Where SimpleGeo fits is, I really think, at the crossroads or the nexus of a lot of people that are trying to figure out this space. So ESRI's a perfect example. They have a lot of data. Their stack is enormous. They answer everything from logistics down to POI things, but they haven't figured out the whole cloud, web, infrastructure, turn key approach. They definitely haven't had to worry about real time. How do you index every single tweet and every single Twitter photo without blowing up? With the data providers, there's been a couple of people that are coming out with APIs and stuff.

I think largely, things are up for grabs. I think one of the issues that I see is as people come out with their location APIs here, like Navatech is coming out with an API, as a developer, in order to do location queries and whatnot, especially on the mobile device, I don't want to have to do five different queries, right? Those round trips could add up a lot when you're on high latency slow networks. So while I think that there's a lot of people that are orbiting the space and I know that there's a lot of people that are going to be coming out with location and geo offerings, I don't know. I think the there a lot of people that are still figuring out and trying to work their way into how they're going to use location and how they're going to expose it.

James Turner: You're doing a fair amount of work with non-relational databases. It seems that that became the hot girl at the dance over the last couple of years. I almost feel at this point like there's starting to be some pushback with people saying, "Wait. There are places where it makes sense to be relational, and there are places where it doesn't." Do you think that it's almost like an ideology now against relational databases?

Joe Stump: No, I think that this is in direct response to the relational database tool chain failing and failing catastrophically at performing in large scale, real-time environments. The simple fact is that creating a relational database that's spread across 50 servers is not an easy thing to build or manage. There's always some sort of app logic you have to build on top of it. And you have to manage the data and whatnot.

If you go into almost any high performance shop that's working at a pretty decent scale, you'll find that they avoid doing foreign key constraints because they'll slow down write. They're partitioning their data across multiple servers, which drastically increases their overhead and managing data. So I think really why you're seeing an uptake in NoSQL is because people have tried that tool chain; it's not evolving. It basically falls over in real-time environments. So when you get things into that situation where it becomes difficult to manage large datasets. It becomes difficult to partition large datasets and then on top of it, the datasets are highly dynamic, I think that's exactly why you see things like Cassandra and HSpace becoming used more and more.

That being said, if you have a highly static dataset that you're querying at a fairly low volume, relational databases are fine. Or if your query time isn't very sensitive. Like for instance, reporting on the back-end. A lot of things that I see is people are do reports and stuff like that just because they don't really know their data model ahead of time. They're going ahead and using MySQL for that, because they don't care if it takes 15 seconds to return a report?

I bet if you track the popularity of social, you'll almost identically track the popularity of NoSQL. Web 1.0 was extremely easy to scale because there was a finite amount of content, highly cacheable, highly static. There was just this rush to put all of the brick and mortar stuff online. So scaling, for instance, an online catalog is not that difficult. You've got maybe 50,000 products. Put it in memcached. MySQL can probably handle those queries. You're doing maybe an order every second or something. It's high value and not a lot of volume. And then, of course, during Web 2.0, we had this bright idea to hand content creation over to the masses. If you draw a diagram and one circle is users and one circle is images, scaling out a whole bunch of users and scaling out a whole bunch of pictures is not too difficult, because you run into the same thing where I need to do a primary key look-up and then I need to cache in memcached because people don't edit their user's data that often. And once they upload a photo, they pretty much never change that.

The problem comes when you intersect those social objects. The intersection in that Venn diagram, which is a join in SQL, falls over pretty quickly. You don't need a very big dataset. Even for a fairly small website, MySQL tends to fall over pretty quickly on those big joins. And most of the NoSQL stuff that people are using, they've found that if you're just doing a primary key look up 99 percent of the time on some table, you don't need to use MySQL and worry about managing that, when you can stuff it into Cassandra or some other NoSQL DB because it's basically just a key value store.

James Turner: A lot is people have just been going into data warehousing and using big cubes and essentially at that point, you've become non-relational anyway?

Joe Stump: Essentially, there are a lot of people out there that are "using MySQL", but they're using it in a very, very NoSQL manner. Like at Digg, for instance, joins were verboten, no foreign key constraints, primary key look-ups. If you had to do ranges, keep them highly optimized and basically do the joins in memory. And it was really amazing. For instance, we rewrote comments about a year-and-a-half ago, I guess, and we switched from doing the sorting, such as sort by least dud or most dud, from doing that on a MySQL front to doing it in PHP. We saw a 4,000 percent increase in performance on that operation.

There's just so many things where you have to basically use MySQL as a NoSQL store. What people have found is, rather than manage these cumbersome MySQL clusters, with NoSQL they get those basics down. With MySQL, if I wanted to spread data across five servers, I'd have to create an application layer that knows that every user that starts with J goes to this server; every user that starts with V goes over here. Whereas with Cassandra, I just give it token ranges and it manages all of that stuff internally. And with MySQL, I'd have to handle something like, "Oh, crap. What if that server goes down?" Well, now I'm going to have three copies across those five servers. So I build more application logic. And then I build application logic to migrate things. Like what happens if I have three Facebook users that use up a disproportionate amount of resources? Well, I've got to move this whale off this MySQL server because the load's really high. And if I moved him off of it, it would be nothing." And then you have to make management tools for migrating data and whatnot.

Cassandra has all of that stuff built in. All of the annoying things that we had at the beginning, NoSQL has gotten those things really, really well and doing really well. The whole data management things, for instance, coming from having been in a shop that has partitioned MySQL, Cassandra is a Godsend, because I just insert stuff and it handles all of that stuff on the back-end. Cassandra, you give token ranges and it does a hash queue of token and then you say, "I want three copies of this data." And Cassandra handles all of that internally and it has this gossip protocol. So if you push a new server into the ring, that new server comes in and everybody's like, "Hey, look, there's a new kid on the block. Well, here. Here's some tokens. " And it all rebalances itself.

So from a data management side, when you're dealing with billions of photos and millions of users, and even if you're not, just the simple fact like, "Well, crap, did I bring this MySQL server down? What do I do?" With Cassandra, you don't do anything. You just shut it off. And Cassandra does all of that remapping and rerouting internally. With MySQL, you have to build tools or buy tools which is, for startups, even worse. Most people don't need complex joins and sorts and all of that other stuff. If they do a little bit of thinking beforehand and do that computation on the write rather than the read, this stuff actually, I think, is a much better approach.