Database War Stories #2: bloglines and memeorandum

In Monday’s installment, Cory Ondrejka of Second Life said “flat files don’t cut it”, but Mark Fletcher of bloglines and Gabe Rivera of memeorandum.com apparently don’t agree.

Gabe wrote: “I didn’t bother with databases because I didn’t need the added complexity… I maintain the full text and metadata for thousands of articles and blog posts in core. Tech.memeorandum occupies about 600M of core. Not huge.”

Mark wrote: “The 1.4 billion blog posts we’ve archived since we went on-line are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines, somewhat like the system outlined in the Google File System paper.”

Here’s what Mark had to say in full:

The subject of databases is either a favorite topic of mine or something I want nothing to do with. Obviously my mood is dependent upon the state of Bloglines’ various databases that particular day. In either case, I’ve done a lot of thinking about them…

Bloglines has several data stores, only a couple of which are managed by “traditional” database tools (which in our case is Sleepycat). User information, including email address, password, and subscription data, is stored in one database. Feed information, including the name of the feed, description of the feed, and the various URLs associated with feed, are stored in another database. The vast majority of data within Bloglines however, the 1.4 billion blog posts we’ve archived since we went on-line, are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines, somewhat like the system outlined in the Google File System paper, but much more specific to just our application. To round things out, we make extensive use of memcached to try to keep as much data in memory as possible to keep performance as snappy as possible.

As evidenced by our design, traditional database systems were not appropriate (or at least the best fit) for large parts of our system. There’s no trace of SQL anywhere (by definition we never do an ad hoc query, so why take the performance hit of a SQL front-end?), we resort to using external (to the databases at least) caches, and a majority of our data is stored in flat files. Sure, we could have just gone with Oracle running on a big SAN, but that would have been very expensive overkill, both on the hardware and on the software licenses (and features, for that matter). And relational databases oftentimes are not the most efficient mechanism to store data, so we’d still most likely have to resort to using memcacheds.

Here’s Gabe:

I didn’t bother with databases because I didn’t need the added complexity… I maintain the full text and metadata for thousands of articles and blog posts in core. Tech.memeorandum occupies about 600M of core. Not huge.

About the flat files: Only if I’m doing a cold start (usually because of a new version) do I need to load the recent history. So I just maintain a flat file with the new data for each hour the system runs and eval the most recent few weeks of hourly files.

eval and Data::Dumper (a sort of “reverse eval” for data) are a handy way to read / write data certain kinds of data when you’re not using a database. I do wish eval ran a little faster though. I wonder how much optimization effort has been put into that.

More entries in the database war stories series: Second Life, Flickr, NASA World Wind, Craigslist, O’Reilly Research, Google File System and BigTable, Findory and Amazon, Brian Aker of MySQL Responds.