|
|
|||||
Joe Stump on data, APIs, and why location is up for grabsThe SimpleGEO CTO and former Digg architect discusses NoSQL and location's futureI recently had a long conversation with Joe Stump, CTO of SimpleGeo, about location, geodata, and the NoSQL movement. Stump, who was formerly lead architect at Digg, had a lot to say. Highlights are posted below. You can find a transcript of the full interview here. Competition in the geodata industry: I personally haven't seen anybody that has come out and said, "We're actively indexing millions of points of data. We're also offering storage and we're giving tools to leverage that. I've seen a lot of fragmentation." Where SimpleGeo fits is, I really think, at the crossroads or the nexus of a lot of people that are trying to figure out this space. So ESRI is a perfect example. They have a lot of data. Their stack is enormous. They answer everything from logistics down to POI things, but they haven't figured out the whole cloud, web, infrastructure, turn-key approach. They definitely haven't had to worry about real time. How do you index every single tweet and every single Twitter photo without blowing up? With the data providers, there's been a couple of people that are coming out with APIs and stuff. How SimpleGeo stores location data: The way that we've gone about doing that is we' actually have two clusters of databases. We've essentially taken Cassandra, the NoSQL relational store that Facebook made, and we've built geospatial features into it. The way that we do that is we actually have an index cluster and then we have a records cluster. The records cluster is very mundane, it's used for getting the actual data. The index cluster, however, we have tackled that in a number of different ways. But the general idea is that those difficult computations that you're talking about are done on write rather than read. So we precompute a number of scenarios and add those to different indexes. And then based on the query, use the most appropriate precomputed index to answer those questions. The virtues of keeping your data API simple: We started out with what we considered to be the most basic widely-needed use case for developers, which is simply "my users here tell me about points of data that are within a certain radius of where my user's sitting." And we've been slowly growing indexes from there. Why NoSQL is gaining in popularity: I think that this is in direct response to the relational database tool chain failing, and failing catastrophically at performing in large scale, real-time environments. The simple fact is that creating a relational database that's spread across 50 servers is not an easy thing to build or manage. There's always some sort of app logic you have to build on top of it. And you have to manage the data and whatnot. The role of social networking in the demise of SQL: I bet if you track the popularity of social, you'll almost identically track the popularity of NoSQL. Web 1.0 was extremely easy to scale because there was a finite amount of content, highly cacheable, highly static. There was just this rush to put all of the brick and mortar stuff online. So scaling, for instance, an online catalog is not that difficult. You've got maybe 50,000 products. Put it in memcached. MySQL can probably handle those queries. You're doing maybe an order every second or something. It's high value and not a lot of volume. And then, of course, during Web 2.0, we had this bright idea to hand content creation over to the masses. If you draw a diagram and one circle is users and one circle is images, scaling out a whole bunch of users and scaling out a whole bunch of pictures is not too difficult, because you run into the same thing where I need to do a primary key look-up and then I need to cache in memcached because people don't edit their user's data that often. And once they upload a photo, they pretty much never change that. How NoSQL simplifies data administration: Essentially, there are a lot of people out there that are "using MySQL," but they're using it in a very, very NoSQL manner. Like at Digg, for instance, joins were verboten, no foreign key constraints, primary key look-ups. If you had to do ranges, keep them highly optimized and basically do the joins in memory. And it was really amazing. For instance, we rewrote comments about a year-and-a-half ago, and we switched from doing the sorting on a MySQL front to doing it in PHP. We saw a 4,000 percent increase in performance on that operation. |
|||||
|
|||||
Comments: 6
ian [23 March 2010 11:04 AM]
Thanks for posting James and I suspect Brady feels you needling into his turf ;)
What SimpleGeo has done is great all around, but let's be clear that they aren't the only ones thinking about the challenge of creating highly performant geodata stores--obviously Fb, twitter, the mobile/social crowd and many others have an immediate (or near term) need for scalable geoinfrastructure. It's clear the big guys will want to develop/own this stack, and there's plenty of opportunity for simplegeo and others to support the needs of small/fast growing orgs who need the read/write. however...it's important to understand what market segments play to NoSQL.
at the risk of sounding like a Luddite, there are companies/industries that may not require this scale--today. Certainly in mobile/social and consumer-facing web publishers there's a compelling need need to have write on steroids, but across enterprise segments and others, I will argue to my grave that an expressive API is more valuable than a simple one.
Data, the new iceberg of IT, also needs to be heard. At Urban Mapping, we've sourced 10s of thousands of variables from obesity rates to noise levels around airport runways, voter registration, floodplains, per capita income and a boatload more. Our hosted geoservices platform (Mapfluence) makes these data sets available via a common query language to support visualization and data queries. The cost of in-licensing data (researching, sourcing, ETLing, maintaining) requires time on task. It's a curated process, partially editorial and partially technical.
the good news is Urban Mapping, SimpleGeo and many others are biting away at traditional GIS from the outside (ie web) and moving in. This bodes well for startups, with lots to gain, and challenges for entrenched players, who have much at risk
K.S. Bhaskar [23 March 2010 12:11 PM]
How does SimpleGeo relate to Open Street Map (http://www.openstreetmap.org) and the OSM XAPI (http://wiki.openstreetmap.org/wiki/Xapi)?
Kevin Bedell [23 March 2010 02:24 PM]
Having struggled with MySQL and TB+ databases, this post really struck a chord with me. The limitations of SQL-based databases as they scale to be very large (no joins, only primary key fetches, no ranged queries) are very real to me -- I spent hours discovering and working around these limitations.
So this approach looks very good to me for *very large* data applications.
I'm not sure yet how to think about applying these tools to smaller problems -- I'd be interested to know where/why these tools fall down for smaller problems (if they do).
I'd also be interested in understanding more about what this means for data/object modeling. If I'm not doing traditional database design under my object models, what *am* I doing?
Nice article -- well done!
Anand Venkataraman [23 March 2010 07:49 PM]
Loved this article and resonated strongly with it. At my previous company, we had the good fortune of extrapolating to foresee a potential scalability issue with a sharded MySQL database exactly like Joe talks about. We migrated all our critical queries to an in-memory distributed Hash solution and immediately noticed a performance lift not unlike the one that Joe found at Digg.
Clearly Cassandra is awesome, and your article equally so!
&
Ben Engber [24 March 2010 10:45 AM]
Great article that really pinpoints the critical limitations relational databases have in serving modern internet applications. I would argue that "using MySQL in a NoSQL manner" is actually a very good thing.
At Thumbtack, when we advise clients on how to scale out their systems we always run into the reality that it is very hard to move a company away from a trusted and proven component (i.e. RDBMS) towards what is often viewed as the latest fad. If we instead take the approach of moving applications to use a traditional datastore like MySQL in a NoSQL compatible way, we create an architecture that scales well and also opens the door for a more transparent migration to a full NoSQL solution later.
You're absolutely right in that this puts more burden on the application. In essence, you provide the lowest common denominator of NoSQL services, and force the application work around that limited functionality. On the other hand, it allows you to tackle development in in a safe phased approach, and is a very practical way to get organizations started down the right path.
epicsystems [28 March 2010 07:31 AM]
Dear Sir,
I have the pleasure to brief on our Data Visualization software
"Trend Compass".
TC is a new concept in viewing statistics and trends in an animated
way by displaying 5 axis (X, Y, Time, Bubble size & Bubble color)
instead of just the traditional X and Y axis. It could be used in
analysis, research, presentation etc. In the banking sector, we have
Deutsche Bank New York as our client.
This a link on weather data :
http://www.epicsyst.com/test/v2/aims/
This is a bank link to compare Deposits, Withdrawals and numbers of
Customers for different branches over time ( all in 1 Chart) :
http://www.epicsyst.com/test/v2/bank-trx/
Misc Examples :
http://www.epicsyst.com/test/v2/airline/
http://www.epicsyst.com/test/v2/stockmarket1/
http://www.epicsyst.com/test/v2/tax/
http://www.epicsyst.com/test/v2/football/
http://www.epicsyst.com/test/v2/swinefludaily/
http://www.epicsyst.com/test/v2/flu/
http://www.epicsyst.com/test/v2/babyboomers/
http://www.epicsyst.com/test/v2/bank-trx/
http://www.epicsyst.com/test/v2/advertising/
This is a project we did with Princeton University on US unemployment :
http://www.epicsyst.com/main3.swf
A 3 minutes video presentation of above by Professor Alan Krueger
Bendheim Professor of Economics and Public Affairs at Princeton
University and currently Chief Economist at the US Treasury using
Trend Compass :
http://epicsyst.com/trendcompass/princeton.aspx?home=1
Latest financial links on the Central Bank of Egypt:
http://www.epicsyst.com/trendcompass/samples/Aggregate-balance-sheet/
http://www.epicsyst.com/trendcompass/samples/balance-sheet
http://www.epicsyst.com/trendcompass/samples/banks-deposits-by-maturity/
http://www.epicsyst.com/trendcompass/samples/egyptian-banks/
http://www.epicsyst.com/trendcompass/samples/currency-by-denomination/
I hope you could evaluate it and give me your comments. So many ideas
are there.
You can download a trial version. It has a feature to export
EXE,PPS,HTML and AVI files. The most impressive is the AVI since you
can record Audio/Video for the charts you create.
http://epicsyst.com/trendcompass/FreeVersion/TrendCompassv1.2_DotNet.zip
All the best.
Epic Systems
www.epicsyst.com