|
|
|||||
NoSQL conference coming to Boston
On March 11 Boston will join several other cities who have host
conferences on the movement broadly known as NoSQL. Cassandra, CouchDB, HBase, HypergraphDB,
Hypertable, Memcached, MongoDB,
Neo4j, Riak, SimpleDB, Voldemort, and
probably other projects as well will be represented at the one-day affair.
It's generally understood that characterizing a movement by what it's not is awkward, and it's hard to find an elevator speech to encompass all the topics of NoSQL Boston. Are these tools for "big data" problems? Usually, but sometimes even small web sites can find them useful. Are the tools meant for processing streams such as log files? Sometimes, but they can be useful for other text and data processing as well. And do they reject relational principles? Well, so you'd think--but different ones reject different principles, so even there it's hard to find commonality. (I compared them to relational databases in a blog last year. The interviews I had with various projects leaders for this article turned up a recurring usage pattern for NoSQL. I was seeking particular domains or types of data where the tools would be useful, but couldn't see much commonality. What connects the users is that they carry out web-related data crunching, searching, and other Web 2.0 related work. I think these companies use NoSQL tools because they're the companies who understand leading-edge technologies and are willing to take risks in those areas. As the field gets better known, usage will spread. I had a talk last week with conference organizer Eliot Horowitz, who is the founder and CTO of 10gen, the company that makes MongoDB. He let me know that the conference plans to bypass the head-scratching and launch into practical applications. The day will contain a coding session and a schema design session along with keynotes. The resilience of open sourceOne question that intrigues me is why all the offerings in the NoSQL area are open source. Some have commercial add-ons, but the core technology is provided as free software. The few proprietary products and services in the market (such as Citrusleaf) get far less attention. Reasons seem to include:
The projects in this conference therefore demonstrate the innovative power of free software. CouchDB and Cassandra are particularly interesting in this regard because they are community efforts more than corporate efforts. Both are Apache top-level projects. (Cassandra was just moved from the incubator to a top-level project on February 17.) CouchDB committer J. Chris Anderson tells me that the Apache community process ensures a wide range of voices are heard, leading to (of course) occasional public wrangling but a superior outcome. The BBC and (according to Anderson) SXSW are among the users of CouchDB, CouchDB has been integrated into Ubuntu, Mozilla Messaging is basing Raindrop (their next-generation messaging platform) on CouchDB, and even mobile handset manufacturers are looking at it. (O'Reilly Media also uses CouchDB.) I also talked to Alan Hoffman of Cloudant, which offers a CouchDB cloud service that fills in some of the gaps left by bare CouchDB (consistent hashing, partitioning, quorum, etc.). Although a couple companies offer commercial support, no single company takes responsibility for CouchDB. Its community is highly distributed. Anderson listed 10 Apache committers working for 8 different companies, and nearly 40 other people who contribute patches. Support takes place on mailing lists (roughly one thousand messages a month) and IRC channels. Jonathan Ellis, project chair of Cassandra, calls it an "open source success story" because it went from a state of near petrification to vibrant regrowth through open sourcing. Facebook invented it and brought it to a state where it satisfied their needs. They made it open in and moved it into the Apache Incubator in 2008 but declared that they would not be doing further development. It could easily have receded into obscurity. Ellis says that he was hired at Rackspace and asked to find a distributed data store that was fast and scaled easily; he decided on Cassandra. Soon after he became a public and enthusiastic advocate, Digg and Twitter joined Rackspace as users and developers. Having multiple QA teams test each release--particularly in very different environments--helps quality immensely. Ellis find that Eric Raymond's "many eyes" characterization of open source bug fixing applies. Although Cassandra is found mostly as a backing store for web sites with a lot of users, Ellis thinks it would meet the needs of many academic and commercial sites, and looks forward to someone offering a cloud service based on it. Justin Sheehy, CTO of Basho, maker of the Riak data store, told me they can confirm the typical advantages cited for open source. Developers at potential customer sites can try out the software without going through a bureaucratic procurement process, and then become internal advocates who function much more effectively than outside salespeople. He also says that companies such as Basho offer the best of both worlds to tentative customers. The backing of a corporation means that professional services and added tools are available to go along with the product those customers buy. But because the source is open and has a community around it, those customers can feel secure that development and support will continue regardless of the fate of the originating company. 10gen, of course, plays a similar role for MongoDB and Anderson's company Couchio offers support for CouchDB. For projects that are not closely associated with the backing of one company, the Apache Foundation's sponsorship helps to ensure continuity. What are the fault lines in the NoSQL landscape?Naturally, the projects I've mentioned in this blog borrow ideas from each other and show tiny variations on common solutions regarding such things as B-tree storage, replication, solutions to locality of reference, etc. Experience will eventually lead to a shake-out and a convergence among surviving projects. In the meanwhile, how can you get your head around them? We'll pause here for a word from our sponsors, letting you know that O'Reilly has published books on CouchDB and Hadoop and is developing one about MongoDB. Horowitz offers an initial subdivision of projects based on data model (document, key-value, or tabular), a theme he explored in another interview. Roger Magoulas, a research director with O'Reilly, further subdivides projects into those that crunch large data sets in a batch manner--such as Hadoop--and those that retrieve views of data to fulfill visitor search requests on web pages or similar tasks. He goes on to say that you can compare them on the basis of particular features, such as automatic replication, auto-sharding or partitioning, and in-memory caches. The most comprehensive attempts I've seen to make sense of this gangly crew of projects from a feature standpoint come in a blog by Ellis and one by blog by Vineet Gupta. (Gupta's blog is labeled "Part 1" and I'd love to see more parts.) But Sheehy says the various features of the offerings interact too strongly and have too many subtle variations to fit into an easy taxonomy. "Many people try to classify the projects, everyone does it differently, and nobody gets it quite right." Community featuresSo who uses these things? To take Horowitz's MongoDB again as an example, many web sites gravitate toward it because the document structure makes some things--adding fields to rows, mapping objects to fields--easier than a relational database does. A few scientific sites also use MongoDB.Riak also has a large following among web sites and startups, but their customers also include media companies, ad networks, SMS gateways, analytics firms, and many other types of organizations. Magoulas finds that an organization's bent is determined by the background and expertise of its developers. Programmers with lots of traditional relational database experience tend to be wary of the recent upstarts, a position reinforced by legacy investments in tools that depend on their relational database and are sometimes very expensive. On the other hand, web programmers look for tools that conform more closely to the data structures and programming techniques they're used to, and can actually be "flummoxed" by relational database logic or abstraction layers on top of the databases. These programmers may think it intuitive to do the kinds of filtering and sorting that seem like reinventing the wheel to a traditional RDMBS programmer. Anderson likes to quote Jacob Kaplan-Moss, the creator of Django, as saying, "Django may be built for the Web, but CouchDB is built of the Web. I've never seen software that so completely embraces the philosophies behind HTTP." 10gen's consultation with MongoDB users includes asking for votes on new features. They also see a great deal of code contributions in the driver layer and adapters (sessions, logging, etc.) but not much in the core. Sheehy said the same is true of Riak: although contributions to the core are rare, half the client libraries are developed by outsiders, and many of the tools. Rapid change is part of life for NoSQL developers. Anderson says of CouchDB, "The ancillary APIs have been evolving rapidly in preparation for our 1.0 release, which should come out in the next few months and won't differ much from today's trunk. The new APIs include authentication, authorization, details of Map/Reduce, and functions for transforming and serving JSON documents as other datatypes such as HTML or CSV." Horowitz stressed that MongoDB will roll out a lot of new features over the upcoming year. One hundred people have signed up for NoSQL Boston so far, and more than 150 are expected. I'll be there to take it in and try to reduce it to some high-level insights for this blog. |
|||||
|
|||||
Comments: 7
Bradford [24 February 2010 02:08 PM]
Cool article, and a very good overview of the "State of Affairs" I'm actually hosting the scalability panel at NoSQL Live (Boston).
You mentioned "Whatever problem an organization is trying to solve, each NoSQL offering tends to be piece of the solution."-- that's exactly one of the problems our startup, Drawn To Scale, has solved.
We make all these problems "go away"". We make it easy and scalable for companies to process, query, serve, store, and search their data in real-time. And it's seamlessly scalable. All companies have to do is put data in an API, we handle the rest :)
Andy Oram [24 February 2010 02:31 PM]
It's fair to mention your service, Bradford--and it actually complements the blog well because I was asking who could offer a cloud service for the kinds of work these tools are doing--but I think potential clients are going to want to know a lot more about the service before they fill out the form on your site. I saw a question on your blog, and if I had you in my own blog I would have asked you a lot more questions. Knowing how big some sites' data sets get, for instance--how much effort goes into managing dedicated servers for data--I'm curious how you're confident you can store everybody's data.
Emil Eifrem [24 February 2010 02:59 PM]
Good overview. One of the things a lot of people seem to forget is that NOSQL is not only about scaling to size, but also scaling to complexity. While petabytes of data is a worthy goal, I believe the more common use case (outside of a few giants like Facebook and Google) is to cope with complex data. That's where a graph database like Neo4j shines.
-EE
Emil Eifrem [24 February 2010 03:05 PM]
Hmm, I messed up that link. Here's the proper link about scaling to complexity.
-EE
Adam Crabtree [25 February 2010 08:32 AM]
I'm currently building a new social networking site (ya, ya, ya...) and while my architecture will tap into new technologies like Node.js, Sammy.js, (big on JavaScript), etc... I've considered using solutions like CouchDB and MongoDB as they seem to have heavy followings within the "Web 2.0" communities.
That being said, I still don't understand how the NoSQL solutions will help solve heavily interrelated data issues such as user information and recommendations like events. I see them as fast solutions for some form of front-loading for real-time and Comet, but they don't seem much use to me beyond that?? Maybe I'm having trouble breaking out of the RDBMS paradigm, but I'm trying hard and I just don't see these as viable for REAL web apps with complex interrelated data.
How would a social network like Facebook utilize a NoSQL solution when so much of their data must be heavily joined and interrelated?
Borislav Iordanov [25 February 2010 10:29 AM]
Adam,
Look at HyperGraphDB (http://www.kobrix.com/hgdb.jsp), it's precisely this kind of problems that it was build to solve. It's an embedded OO hypergraph db (think db4o + general hypergraphs).
Boris
Adam Saltiel [12 March 2010 05:19 PM]
Adam,
I'm interested, how would you use Node.js with a graph db? How would Node.js interface with the db, is there an existing means in Node.js api?
I think you are wrong about not being suitable for real web apps with complicated interrelated data. I understand that graph dbs specifically are for such data and that a problem with RDBMs is that they do not have flexible schema, therefore flexible relationships between those schema. I don't think this is too hard to understand in that the db access layer in an app must model the underlying db schema, data manipulation is done with objects that are determined by the schema. In this situation it is particularly difficult to join the data in ways outside of that dictated by the schema. Think of the difficulty of extracting hierarchical data from rdbms. These issues are explained well in Beautiful Data, O'Reilly, 2009, esp. in chapter 20, as well as the background to various of the NoSQL projects.