Joe Stump on data, APIs, and why location is up for grabs

The SimpleGEO CTO and former Digg architect discusses NoSQL and location's future

I recently had a long conversation with Joe Stump, CTO of SimpleGeo, about location, geodata, and the NoSQL movement. Stump, who was formerly lead architect at Digg, had a lot to say. Highlights are posted below. You can find a transcript of the full interview here.

Competition in the geodata industry:

I personally haven’t seen anybody that has come out and said, “We’re actively indexing millions of points of data. We’re also offering storage and we’re giving tools to leverage that. I’ve seen a lot of fragmentation.” Where SimpleGeo fits is, I really think, at the crossroads or the nexus of a lot of people that are trying to figure out this space. So ESRI is a perfect example. They have a lot of data. Their stack is enormous. They answer everything from logistics down to POI things, but they haven’t figured out the whole cloud, web, infrastructure, turn-key approach. They definitely haven’t had to worry about real time. How do you index every single tweet and every single Twitter photo without blowing up? With the data providers, there’s been a couple of people that are coming out with APIs and stuff.

I think largely, things are up for grabs. I think one of the issues that I see is as people come out with their location APIs here, like NAVTEQ is coming out with an API, as a developer, in order to do location queries and whatnot, especially on the mobile device, I don’t want to have to do five different queries, right? Those round trips could add up a lot when you’re on high latency slow networks. So while I think that there’s a lot of people that are orbiting the space and I know that there’s a lot of people that are going to be coming out with location and geo offerings, a lot of people are still figuring out and trying to work their way into how they’re going to use location and how they’re going to expose it.

How SimpleGeo stores location data:

MySQL Conference and Expo

The way that we’ve gone about doing that is we’ actually have two clusters of databases. We’ve essentially taken Cassandra, the NoSQL relational store that Facebook made, and we’ve built geospatial features into it. The way that we do that is we actually have an index cluster and then we have a records cluster. The records cluster is very mundane, it’s used for getting the actual data. The index cluster, however, we have tackled that in a number of different ways. But the general idea is that those difficult computations that you’re talking about are done on write rather than read. So we precompute a number of scenarios and add those to different indexes. And then based on the query, use the most appropriate precomputed index to answer those questions.

We do that in a number of different ways. We do that through very careful key construction based on a number of different algorithms that are developed publicly and a couple that we developed internally. One of the big problems with partitioning location data is that location data has natural density problems. If you partition based on UTM or zip code or country or whatever, servers that are handling New York City are going to be overloaded. And servers that are handling South Dakota are going to be bored out of their minds. We basically have answered that in a couple of different layers. And then we also have some in-process stuff that we do before serving up the request.

The virtues of keeping your data API simple:

We started out with what we considered to be the most basic widely-needed use case for developers, which is simply “my users here tell me about points of data that are within a certain radius of where my user’s sitting.” And we’ve been slowly growing indexes from there.

Our most basic index is that my user is at this lat/long, tell me about any point that I’ve told you about previously, within one kilometer. That’s a very basic query. We do some precomputing on the write to help with those types of queries. The next level up is that my user was here a week ago and would like to see stuff that was produced or put into the system a week ago. So we’ve, again, done some precomputing on the indexes to do temporal spatial queries.

A lot of users, once they see a storage thing in the cloud, they inevitably want to be able to do very open-ended abstract queries like “show me all pieces of data that are in this zip code where height is this and age is greater than this, ordered by some other thing.” We push back on that. Basically, we’re not a generic database in the cloud. We’re there specifically to help you manage your geodata.

Why NoSQL is gaining in popularity:

I think that this is in direct response to the relational database tool chain failing, and failing catastrophically at performing in large scale, real-time environments. The simple fact is that creating a relational database that’s spread across 50 servers is not an easy thing to build or manage. There’s always some sort of app logic you have to build on top of it. And you have to manage the data and whatnot.

If you go into almost any high performance shop that’s working at a pretty decent scale, you’ll find that they avoid doing foreign key constraints because they’ll slow down write. They’re partitioning their data across multiple servers, which drastically increases their overhead and managing data. So I think really why you’re seeing an uptake in NoSQL is because people have tried that tool chain; it’s not evolving. It basically falls over in real-time environments.

The role of social networking in the demise of SQL:

I bet if you track the popularity of social, you’ll almost identically track the popularity of NoSQL. Web 1.0 was extremely easy to scale because there was a finite amount of content, highly cacheable, highly static. There was just this rush to put all of the brick and mortar stuff online. So scaling, for instance, an online catalog is not that difficult. You’ve got maybe 50,000 products. Put it in memcached. MySQL can probably handle those queries. You’re doing maybe an order every second or something. It’s high value and not a lot of volume. And then, of course, during Web 2.0, we had this bright idea to hand content creation over to the masses. If you draw a diagram and one circle is users and one circle is images, scaling out a whole bunch of users and scaling out a whole bunch of pictures is not too difficult, because you run into the same thing where I need to do a primary key look-up and then I need to cache in memcached because people don’t edit their user’s data that often. And once they upload a photo, they pretty much never change that.

The problem comes when you intersect those social objects. The intersection in that Venn diagram, which is a join in SQL, falls over pretty quickly. You don’t need a very big dataset. Even for a fairly small website, MySQL tends to fall over pretty quickly on those big joins. And most of the NoSQL stuff that people are using, they’ve found that if you’re just doing a primary key look up 99 percent of the time on some table, you don’t need to use MySQL and worry about managing that, when you can stuff it into Cassandra or some other NoSQL DB because it’s basically just a key value store.

How NoSQL simplifies data administration:

Essentially, there are a lot of people out there that are “using MySQL,” but they’re using it in a very, very NoSQL manner. Like at Digg, for instance, joins were verboten, no foreign key constraints, primary key look-ups. If you had to do ranges, keep them highly optimized and basically do the joins in memory. And it was really amazing. For instance, we rewrote comments about a year-and-a-half ago, and we switched from doing the sorting on a MySQL front to doing it in PHP. We saw a 4,000 percent increase in performance on that operation.

There’s just so many things where you have to basically use MySQL as a NoSQL store. What people have found is, rather than manage these cumbersome MySQL clusters, with NoSQL they get those basics down. With MySQL, if I wanted to spread data across five servers, I’d have to create an application layer that knows that every user that starts with “J” goes to this server; every user that starts with “V” goes over here. Whereas with Cassandra, I just give it token ranges and it manages all of that stuff internally. And with MySQL, I’d have to handle something like, “Oh, crap. What if that server goes down?” Well, now I’m going to have three copies across those five servers. So I build more application logic. And then I build application logic to migrate things. Like what happens if I have three Facebook users that use up a disproportionate amount of resources? Well, I’ve got to move this whale off this MySQL server because the load’s really high. And if I moved him off of it, it would be nothing. And then you have to make management tools for migrating data and whatnot.

Cassandra has all of that stuff built in. All of the annoying things that we had at the beginning, NoSQL has gotten those things really, really well. Coming from having been in a shop that has partitioned MySQL, Cassandra is a Godsend.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • ian

    Thanks for posting James and I suspect Brady feels you needling into his turf ;)

    What SimpleGeo has done is great all around, but let’s be clear that they aren’t the only ones thinking about the challenge of creating highly performant geodata stores–obviously Fb, twitter, the mobile/social crowd and many others have an immediate (or near term) need for scalable geoinfrastructure. It’s clear the big guys will want to develop/own this stack, and there’s plenty of opportunity for simplegeo and others to support the needs of small/fast growing orgs who need the read/write. however…it’s important to understand what market segments play to NoSQL.

    at the risk of sounding like a Luddite, there are companies/industries that may not require this scale–today. Certainly in mobile/social and consumer-facing web publishers there’s a compelling need need to have write on steroids, but across enterprise segments and others, I will argue to my grave that an expressive API is more valuable than a simple one.

    Data, the new iceberg of IT, also needs to be heard. At Urban Mapping, we’ve sourced 10s of thousands of variables from obesity rates to noise levels around airport runways, voter registration, floodplains, per capita income and a boatload more. Our hosted geoservices platform (Mapfluence) makes these data sets available via a common query language to support visualization and data queries. The cost of in-licensing data (researching, sourcing, ETLing, maintaining) requires time on task. It’s a curated process, partially editorial and partially technical.

    the good news is Urban Mapping, SimpleGeo and many others are biting away at traditional GIS from the outside (ie web) and moving in. This bodes well for startups, with lots to gain, and challenges for entrenched players, who have much at risk

  • K.S. Bhaskar

    How does SimpleGeo relate to Open Street Map (http://www.openstreetmap.org) and the OSM XAPI (http://wiki.openstreetmap.org/wiki/Xapi)?

  • Kevin Bedell

    Having struggled with MySQL and TB+ databases, this post really struck a chord with me. The limitations of SQL-based databases as they scale to be very large (no joins, only primary key fetches, no ranged queries) are very real to me — I spent hours discovering and working around these limitations.

    So this approach looks very good to me for *very large* data applications.

    I’m not sure yet how to think about applying these tools to smaller problems — I’d be interested to know where/why these tools fall down for smaller problems (if they do).

    I’d also be interested in understanding more about what this means for data/object modeling. If I’m not doing traditional database design under my object models, what *am* I doing?

    Nice article — well done!

  • Anand Venkataraman

    Loved this article and resonated strongly with it. At my previous company, we had the good fortune of extrapolating to foresee a potential scalability issue with a sharded MySQL database exactly like Joe talks about. We migrated all our critical queries to an in-memory distributed Hash solution and immediately noticed a performance lift not unlike the one that Joe found at Digg.

    Clearly Cassandra is awesome, and your article equally so!

    &

  • Ben Engber

    Great article that really pinpoints the critical limitations relational databases have in serving modern internet applications. I would argue that “using MySQL in a NoSQL manner” is actually a very good thing.

    At Thumbtack, when we advise clients on how to scale out their systems we always run into the reality that it is very hard to move a company away from a trusted and proven component (i.e. RDBMS) towards what is often viewed as the latest fad. If we instead take the approach of moving applications to use a traditional datastore like MySQL in a NoSQL compatible way, we create an architecture that scales well and also opens the door for a more transparent migration to a full NoSQL solution later.

    You’re absolutely right in that this puts more burden on the application. In essence, you provide the lowest common denominator of NoSQL services, and force the application work around that limited functionality. On the other hand, it allows you to tackle development in in a safe phased approach, and is a very practical way to get organizations started down the right path.

  • epicsystems

    Dear Sir,

    I have the pleasure to brief on our Data Visualization software
    “Trend Compass”.

    TC is a new concept in viewing statistics and trends in an animated
    way by displaying 5 axis (X, Y, Time, Bubble size & Bubble color)
    instead of just the traditional X and Y axis. It could be used in
    analysis, research, presentation etc. In the banking sector, we have
    Deutsche Bank New York as our client.

    This a link on weather data :

    http://www.epicsyst.com/test/v2/aims/

    This is a bank link to compare Deposits, Withdrawals and numbers of
    Customers for different branches over time ( all in 1 Chart) :

    http://www.epicsyst.com/test/v2/bank-trx/

    Misc Examples :

    http://www.epicsyst.com/test/v2/airline/
    http://www.epicsyst.com/test/v2/stockmarket1/
    http://www.epicsyst.com/test/v2/tax/
    http://www.epicsyst.com/test/v2/football/
    http://www.epicsyst.com/test/v2/swinefludaily/
    http://www.epicsyst.com/test/v2/flu/
    http://www.epicsyst.com/test/v2/babyboomers/
    http://www.epicsyst.com/test/v2/bank-trx/
    http://www.epicsyst.com/test/v2/advertising/

    This is a project we did with Princeton University on US unemployment :
    http://www.epicsyst.com/main3.swf

    A 3 minutes video presentation of above by Professor Alan Krueger
    Bendheim Professor of Economics and Public Affairs at Princeton
    University and currently Chief Economist at the US Treasury using
    Trend Compass :
    http://epicsyst.com/trendcompass/princeton.aspx?home=1

    Latest financial links on the Central Bank of Egypt:

    http://www.epicsyst.com/trendcompass/samples/Aggregate-balance-sheet/
    http://www.epicsyst.com/trendcompass/samples/balance-sheet
    http://www.epicsyst.com/trendcompass/samples/banks-deposits-by-maturity/
    http://www.epicsyst.com/trendcompass/samples/egyptian-banks/
    http://www.epicsyst.com/trendcompass/samples/currency-by-denomination/

    I hope you could evaluate it and give me your comments. So many ideas
    are there.

    You can download a trial version. It has a feature to export
    EXE,PPS,HTML and AVI files. The most impressive is the AVI since you
    can record Audio/Video for the charts you create.

    http://epicsyst.com/trendcompass/FreeVersion/TrendCompassv1.2_DotNet.zip

    All the best.

    Epic Systems
    http://www.epicsyst.com