Database War Stories #8: Findory and Amazon

Once Greg Linden had pinged me about BigTable (leading to yesterday’s entry), it occurred to me to ask Greg for his own war stories, both at Findory and Amazon. We hear a recurrent theme: the use of flat files and custom file systems. Despite the furor that ensued in the comments when Mark Fletcher and Gabe Rivera noted that they didn’t use traditional SQL databases, Web 2.0 applications really do seem to have different demands, especially once they get to scale. But history shows us that custom, home-grown solutions created by alpha geeks point the way to new entrepreneurial opportunities…Greg wrote:

There are a couple stories I could tell, one about the early systems at Amazon, one about the database setup at Findory.

On Findory, our traffic and crawl is much smaller than sites like Bloglines, but, even at our size, the system needs to be carefully architected to be able to rapidly serve up fully personalized pages for each user that change immediately after each new article is read.

Our read-only databases are flat files — Berkeley DB to be specific — and are replicated out using our own replication management tools to our webservers. This strategy gives us extremely fast access from the local filesystem. We make thousands of random accesses to this read-only data on each page serve; Berkeley DB offers the performance necessary to be able to still serve our personalized pages rapidly under this load.

Our much smaller read-write data set, which includes information like each user’s reading history, is stored in MySQL. MySQL MyISAM works very well for this type of non-critical data since speed is the primary concern and more sophisticated transactional support is not important. While it has not been necessary yet, our intention is to scale our MySQL database with horizontal partitioning, though we may also experiment with MySQL Cluster as well.

For both read-only and read-write data, we have been careful to keep the data formats compact to ensure that the total active data set can easily fit in main memory. We attempt to avoid as much disk access as we can.

After all of this, Findory is able to serve fully personalized pages, different for each reader, that change immediately when each person reads a new article, all in well under 100ms. People don’t like to wait. We believe getting what people need quickly and reliably is an important part of the user experience.

[Regarding Amazon], as it turns out, I have already written up a fairly detailed description of Amazon’s early (1997) systems on my weblog. An excerpt from that post:

[We] were going to take Amazon from one webserver to multiple webservers.

There would be two staging servers, development and master, and then a fleet of online webservers. The staging servers were largely designed for backward compatibility. Developers would share data with development when creating new website features. Customer service, QA, and tools would share data with master. This had the added advantage of making master a last wall of defense where new code and data would be tested before it hit online.

Read-only data would be pushed out through this pipeline. Logs would be pulled off the online servers. For backward compatibility with log processing tools, logs would be merged so they looked like they came from one webserver and then put on a fileserver.

Stepping out for a second, this is a point where we really would have liked to have a robust, clustered, replicated, distributed file system. That would have been perfect for read-only data used by the webservers.

NFS isn’t even close to this. It isn’t clustered or replicated. It freezes all clients when the NFS server goes down. Ugly. Options that are closer to what we wanted, like CODA, were (and still are) in the research stage.

Without a reliable distributed file system, we were down to manually giving each webserver a local copy of the read-only data.

More entries in the database war stories series: Second Life, Bloglines and Memeorandum, Flickr, NASA World Wind, Craigslist, O’Reilly Research, Google File System and BigTable, Brian Aker of MySQL Responds.

  • Excellent as always! On the ‘distributed filesystem’ point, anyone who’s looking for one might want to check out Brad Fitzpatrick’s (of LiveJournal fame) MogileFS. It’s not a bad solution although I’m not sure of the latency.

  • Further to my previous comment, there’s a small discussion about MogileFS (and its pitfalls) in the Amazon post on Greg’s site :)

  • Perrin Harkins

    Regarding “flat-files” mentioned here and in other entries, people seems to be applying the term to many different things. My understanding has always been that a flat file is just data and delimiters, with no index. An XML file or a CSV file would be examples. BerkeleyDB is a highly tuned disk-based btree database (we used to call them dbm files), similar to the ISAM tables that MySQL was originally built on. It has an index, manages its own cache, supports multi-process page-level locking, transactions, etc. It’s not a relational database, but I wouldn’t put it in the same league as a CSV file, and I wouldn’t call it a flat-file.

  • Perrin, I think another element of flat files is that they are extremely easy to parse — that there’s a simple delimiter and regular structure. So I’d say tab-delimited and CSV files are flat, but XML files are not.

    I worked for Amazon subsidiary Alexa Internet and also the Internet Archive, and we were big users of flat files for analysis and permanent storage. The Internet Archive’s published standard for storing archived snapshots of web sites — the “arc” file — is a flat file.


  • Great point, Perrin. BDBs are not really flat files. Sorry, I was using the term sloppily.

    It is also true that BerkeleyDB has added many features including transactional support. Since Findory is using these databases read-only, we have little need for many of those features.

    I have wondered if we should consider a simpler, more compact data format to see if we could get further performance gains. At the moment, looking for that kind of optimization is not pressing, but it might be something to explore and test in the future.

  • Perrin Harkins

    Greg, if you find anything faster than BDB, I’d be interested to hear about it. I test all the hash-like storage systems I can lay hands on, and so far none of them has been significantly faster than DBD. It does seem like a read-only format should allow for more optimization.

  • Greg, Perrin, if you find anything faster than Berkeley DB, we’d like to hear about it! Thanks for the kind words.

  • Stephen Deasey

    Faster than BDB: