• Print

Database War Stories #3: Flickr

Continuing my series of queries about how “Web 2.0″ companies used databases, I asked Cal Henderson of Flickr to tell me “how the folksonomy model intersects with the traditional database. How do you manage a tag cloud?”

He replied: “lots of the ‘web 2.0′ feature set doesn’t fit well with traditional normalised db schema design. denormalization (or heavy caching) is the only way to generate a tag cloud in milliseconds for hundereds of millions of tags. you can cache stuff that’s slow to generate, but if it’s so expensive to generate that you can’t ever regenerate that view without pegging a whole database server then it’s not going to work.”

Here’s the full text of my exchange with Cal:

The first question I asked was “what’s your database architecture?” Here’s what Cal had to say about that:

“started as a big vertically scaled classic replication tree (well, started as a single box) and soon hit against performance walls in terms of writes (repl gives you more reads, same writes as slowest machine in your tree). when we started, we barely knew anything about how mysql works with indexes (the documentation is fairly non-beginner), so months of index wrangling to get better performance ensued.

eventually there was only so much we could do, so we federated the main dataset into shards, organised by primary object (mainly users), splitting the data up into chunks. we do this in such a way that we can easily move data between shards as we add news ones or start hitting performance problems on old ones. our shards are master-master replication pairs.”

Next, I asked about lessons learned in managing the data store, and any particular war stories that would make great illustrations of those lessons learned. Cal replied:

“be careful with big backup files – writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. doing this to an in-production photo storage filer is a bad idea.

however much it costs to keep multiple days of backups of all of your data, it’s worth it. keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.

I also asked for any information on the scale of data Flickr manages and its growth rates. Cal answered:

total stored unique data : 935 GB

total stored duplicated data : ~3TB

I also asked Cal: “I’m particularly interested in how the folksonomy model intersects with the traditional database. How do you manage a tag cloud? A lot of ideas about how databases are supposed to look start to go by the wayside…” He replied:

“tags are an interesting one. lots of the ‘web 2.0′ feature set doesn’t fit well with traditional normalised db schema design. denormalization (or heavy caching) is the only way to generate a tag cloud in milliseconds for hundereds of millions of tags. you can cache stuff that’s slow to generate, but if it’s so expensive to generate that you can’t ever regenerate that view without pegging a whole database server then it’s not going to work (or you need dedicated servers to generate those views – some of our data views are calculated offline by dedicated processing clusters which save the results into mysql).

federating data also means denormalization is necessary – if we cut up data by user, where do we store data which relates to two users (such as a comment by one user on another user’s photo). if we want to fetch it in the context of both user’s, then we need to store it in both shards, or scan every shard for one of the views (which doesn’t scale). we store alot of data twice, but then theres the issue of it going out of sync. we can avoid this to some extent with two-step transactions (open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st commited) but there still a chance for failure when a box goes down during the 1st commit.

we need new tools to check data consistency across multiple shards, move data around shards and so on – a lot of the flickr code infrastructure deals with ensuring data is consistent and well balanced and finding and repairing it when it’s not.”

More entries in the database war stories series: Second Life, Bloglines and Memeorandum, NASA World Wind, Craigslist, O’Reilly Research, Google File System and BigTable, Findory and Amazon, Brian Aker of MySQL Responds.

tags:
  • http://www.stroeck.com Michael Ströck

    3TB?! Does that include the image files? Seems rather low to me.

  • Joe Zobkiw

    Forget the 3TB – the 935 GB of UNIQUE data seems low! I’m presuming the TB figure is backups and the GB figure is actual “live” data.

  • Jonathan

    They’re talking about database in this article. 935 GB in the database. That’s tags, comments, users, and other meta-data.

    Photo storage is another matter.

  • cal

    correct – those figures don’t include photos; they include only database rows and indices. photos files themselves are not stored in a database.

  • http://blog.doublegrande.com/espresso doublegrande

    I don’t think the disparity from unique is from live/backup, but shows the scale of duplication involved in using the shards. It’s a pretty amazing accomplishment that flickr has been able to keep its ‘massages’ down to a minimum while keeping up with such phenomenal growth. I’d imagine it’s one of the more rewarding data projects to have been a part of, along with the likes of the Amazon.com.

  • http://www.nationalgazette.org/ Croaky

    Where are the photos stored? Simply on web servers?

  • http://gridrunner.blogspot.com gridrunner

    All photos (including thumbnails) download from the subdomain static.flickr.com. I assume this points at a completely different set of web servers. The last thing they want is image download traffic interferring with database communications.

  • anonymoustroll

    > hey include only database rows and indices.
    > photos files themselves are not stored in a
    > database.

    Holy shit… almost a terabyte of text?

    I don’t even want to think about how much storage is required for the actual pictures.

  • http://www.sprinj.com heri

    is it just me but it seems that their architecture is going to break up one day? isn’t there a way to build websites that scale from day 1, and you just need to add web servers afterwards?

  • Anonymous

    Sure there’s a way to build websites that scale from day 1. Problem is, that costs money. Something that constrains most startups. Hence, the usually model is you start with something some small and affordable, and by the time your architecture doesn’t scale anymore, you hope to have made enough money to build the next level. And this may iterate a few times.

  • http://poliertes-leder.blogspot.com spongetoad

    I have had lots of positive experience with normalised databases and web 2.0. I think, they both work very well together

  • http://theprogrammersparadox.blogspot.com Paul W. Homer

    If you essentially have to ‘break’ the relational model in order to ‘scale’ the relational model to work in these bigger systems, doesn’t that say something significant about the relational model?

    All this pain, agony and work just to keep a lot of relatively shallow data around a bit longer and get at it quickly. Maybe starting with a generalized database isn’t such a great idea?

    Paul.

    http://theprogrammersparadox.blogspot.com

  • http://theprogrammersparadox.blogspot.com Paul W. Homer

    Eek, sorry about the extra spaces. I guess this is another one of those sites where the ‘preview’ display doesn’t match the final display:

    http://theprogrammersparadox.blogspot.com/2007/12/silly-little-storm.html

    If it wasn’t for inconsistency there’d be no consistency at all… :-)

  • http://www.fatherstorm.com FatherStorm

    This points to what is likely the largest (weakness/strength) of databases today. Virtually all programming is done using some sort of OOP (Object Oriented Programming) language, where you define an object, pass it values and variables, make it do all kinds of stuff internally, and get back something you can work with. Databases are not designed to be treated as objects since everything’s tied to a table, a row, a column, a cell. The entire concept of tables is counter to OOP, and until someone a hell of a lot smarter than anyone in this (virtual) room redesigns the way databases work at a fundamental level, we’re probably stuck with what we have.

  • http://www.inverudio.com lazar

    1TB may seem ‘not much’ as it can fit in two larger hard drives, but having some experience with netflixprize dataset which is about 1Gb (1000 times less), I can definitely say that operating 1TB database is quite an impressive thing.

    More information on Flickr’s use of MySQL is here:

    http://www.mysql.com/customers/customer.php?id=246

  • Dan

    Its too bad that writes couldn’t be done to a normalized “authoritative” schema, and then copied into a denormalized schema to the various shards that need the data. Reading would be done on the denormalized copies only. That would solve the problem of data getting out of sync. Although not having worked with data-sets this large before I’m sure I’m simplifying things and there’s a reason why this wouldn’t work.

  • http://natishalom.typepad.com Nati Shalom

    Another alternative for scaling out MySql is provided here
    The idea is to leave your data base untouched and use in-memory cluster in front of the data base to deal with high traffic volume of reads and writes while keeping the data base synchronized as a background process.