Thu

Apr 27
2006

Tim O'Reilly

Tim O'Reilly

Database War Stories #2: bloglines and memeorandum

In Monday's installment, Cory Ondrejka of Second Life said "flat files don't cut it", but Mark Fletcher of bloglines and Gabe Rivera of memeorandum.com apparently don't agree.

Gabe wrote: "I didn't bother with databases because I didn't need the added complexity... I maintain the full text and metadata for thousands of articles and blog posts in core. Tech.memeorandum occupies about 600M of core. Not huge."

Mark wrote: "The 1.4 billion blog posts we've archived since we went on-line are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines, somewhat like the system outlined in the Google File System paper."

Here's what Mark had to say in full:

The subject of databases is either a favorite topic of mine or something I want nothing to do with. Obviously my mood is dependent upon the state of Bloglines' various databases that particular day. In either case, I've done a lot of thinking about them...

Bloglines has several data stores, only a couple of which are managed by "traditional" database tools (which in our case is Sleepycat). User information, including email address, password, and subscription data, is stored in one database. Feed information, including the name of the feed, description of the feed, and the various URLs associated with feed, are stored in another database. The vast majority of data within Bloglines however, the 1.4 billion blog posts we've archived since we went on-line, are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines, somewhat like the system outlined in the Google File System paper, but much more specific to just our application. To round things out, we make extensive use of memcached to try to keep as much data in memory as possible to keep performance as snappy as possible.

As evidenced by our design, traditional database systems were not appropriate (or at least the best fit) for large parts of our system. There's no trace of SQL anywhere (by definition we never do an ad hoc query, so why take the performance hit of a SQL front-end?), we resort to using external (to the databases at least) caches, and a majority of our data is stored in flat files. Sure, we could have just gone with Oracle running on a big SAN, but that would have been very expensive overkill, both on the hardware and on the software licenses (and features, for that matter). And relational databases oftentimes are not the most efficient mechanism to store data, so we'd still most likely have to resort to using memcacheds.

Here's Gabe:

I didn't bother with databases because I didn't need the added complexity... I maintain the full text and metadata for thousands of articles and blog posts in core. Tech.memeorandum occupies about 600M of core. Not huge.

About the flat files: Only if I'm doing a cold start (usually because of a new version) do I need to load the recent history. So I just maintain a flat file with the new data for each hour the system runs and eval the most recent few weeks of hourly files.

eval and Data::Dumper (a sort of "reverse eval" for data) are a handy way to read / write data certain kinds of data when you're not using a database. I do wish eval ran a little faster though. I wonder how much optimization effort has been put into that.

More entries in the database war stories series: Second Life, Flickr, NASA World Wind, Craigslist, O'Reilly Research, Google File System and BigTable, Findory and Amazon, Brian Aker of MySQL Responds.


tags: web 2.0  | comments: 23   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/4611

Comments: 23

  Nicholas Clark [04.27.06 07:04 AM]

I do wish eval ran a little faster though. I wonder how much optimization effort has been put into that.

eval is basically a wrapper around Perl's parser. I don't think that there is that much slack to be shaved from the wrapper, so any speed gains would have to be found from the parser itself. What does profiling it show while feeding it your data? Alternatively can you use Storable to save/restore your data? It is written with speed in mind, and should outperform Data::Dumper/eval for larger structures.

  Paul Mison [04.27.06 09:15 AM]

The only advantage I can think of for eval/Data::Dumper over Storable is that the data is stored in a somewhat human readable format. If you really care about that, I'd be considering YAML (either the original, or YAML::Syck, which I gather is better in various ways).

Mind you, Nicholas and I both have a certain perspective on data persistence using Storable that may be skewing my thoughts on what's important.

  phil swenson [04.27.06 12:25 PM]

Using flat files for storing data is a terrible decision.

with flat files whenever you want to change the "schema" of text files you're in for some serious pain. if you ever wants to extract/analyze data it will be painful.

It's not like you have to run Oracle.... just use MySQL. MySQL is extremely easy to setup, admin, and backup.

I've run across this mentaility before... I inherited a flat file e-commerce system. What a disaster. Schema changed every few days, to do reports they had written 100s of "if thens" to handle all the variations. When I was done doing a revenue report was as simple as "select sum(total) from order" To change a schema it was a simple "alter table" statement.

There's a reason databases were created.

  Gabe [04.27.06 12:42 PM]

Nicholas: yep, that's basically what I mean: I wonder how much the parser itself could speed up. Since parsing is a small part of the running time for most apps, I wouldn't be surprised if parser optimizations were low priority, meaning lots of headroom might remain. I could be wrong about the priority thing -- haven't really investigated. All I know is a faster eval would help me.

Paul: human readable is important. There are other advantages too. A single file can contain multiple definitions, and even conditional assignments. Strictly speaking, my data files aren't just data.

  John Hart [04.27.06 12:46 PM]

Phil, I disagree with your categorical statement that "using flat files for storing data is a terrible decision". I would think that, for a blog system, storing the postings in flat files is a great decision.



Databases are useful for ad-hoc queries, schema flexibility, transactional semantics, and flexible indexing.



If you don't need these things, you can get a _lot_ of wins by not using a database. If you've been developing within a database for a long time, seeing the performance you can get from flat files will blow you away. You realize, "oh, wow, computers are a lot faster than I realized because the database has been in my way for so long."



As a funny example, what's the quickest way to get a sorted 1TB dataset from Oracle? No matter how you define your indices, it's still quicker to export unsorted data via SQL*Loader and then sort it yourself. Crazy, but (in my experience) true.

  Josh Peters [04.27.06 01:54 PM]

Phil's likely gonna take a lot of slack for his categorical statement :)

It's got a lot of truth to it though, but it all depends on what you store and how.

If you've got a site that's been setup in a Dreamweaver-style template system, it rather pales in comparison to a nice "real" template based MySQL application.

That doesn't have to be the only way though, you can just as easily store flat files of XML data and utilize something like XSLT on the server side to transform it into whatever format you desire.

Flat files have their place, for darn sure. It's not nearly as robust as a database for integrity's sake, but there are a lot of schemas that suck in databases too :)

Regardless of the way you go, the performance is likely gained or lost in the schema.

  Mike Herrick [04.27.06 06:23 PM]

Depends on what you are doing. I posted the other day on some lessons I learned on this topic with an event driven system.

Here is the gist:
Separate transient (i.e., part of event workflow), terminal state (i.e., completed event workflow), and reporting data bases (ODS, Data ware house, OLAP)

  Francois Orsini [04.27.06 07:06 PM]

Easy solution - use Apache Derby - free, open source, fast, lightweight and zero admin

http://db.apache.org/derby/

  Scott Lewis [04.27.06 09:22 PM]

It does seem to go somewhat beyond issues of speed. Part of Phil's point may be that databases provide a consistent, highly-referenced system for storing and retrieving data, whereas flat-file systems tend to be idiosyncratic. This affects not only ongoing support issues, but the actual value of a product, potentially. New purchasers of an online site, for example, would want surety that the data can be easily and reliably managed in the future. Databases tend to be better at providing that than flat-file systems, no matter how robust or fast.

  phil swenson [04.28.06 08:43 AM]

Let me amend my statement "Using flat files for storing data is a terrible decision." to "Using flat files for storing data is a terrible decision in 99% of cases." In software there are always exceptions.

I was thinking more towards web apps which is what the posting was about.

I've seen a lot of shops who think DBs are slow so they end up writing their own systems for managing the data. In every case I've witnessed this was a mistake, they were spending precious resources trying to be clever instead of solving business problems. A properly designed DB is fast as snot. There are rare cases when a DB isn't appropriate (like maybe a desktop app where you are only saving small amounts of data, search engines, stuff like that)...but for almost every case in the rest of the world a DB is the right move.

JMO of course :)

  Stu [04.28.06 09:54 AM]

Wouldnt it make more sense to store them in a DB and flatfile export them out as text, give you both sides of the coin rather than one.


Flat files make me think of a) nasty csv stuff or b) ye olde pc-file+ / dbase type stuff where your 'db' is just a table where every field is N characters wide.


I can understand the speed thing about using flatfiles. I worked in a large datawarehouse where we had multiple as400's with many TB's of data where everything was fully denormalised (demoralised data!) for sake of query speed.

  Patrick Tufts [04.28.06 10:06 AM]

"if you ever wants to extract/analyze data it will be painful."

In my experience, doing very large-scale data analysis at Amazon and IBM, this is not the case. I prefer to work with flat files (either logfiles or full database extracts) because I'm going to do multiple passes over nearly every record, and most databases, while great for random access to records, are terrible when you just want everything.

--Pat

  Andrew Hedges [04.28.06 01:08 PM]

Why not get the best of both worlds? Use your database for the canonical version of the data, but write the data to flat files every so often (hour, day, etc., depending on your requirements) and get the benefits listed above?

"Can't we all get along?"

  Andy [04.28.06 03:08 PM]

Oh yeah, well my database sucks worse than yours! (since these are "war stories"), and we couldn't create indexes on zeros, only ones!

Here's some more ancedotal comparisons to feed the fire.

I have a huge directed graph stored in MySQL as a single table with (parent, child) relationships. As an optimization for other purposes, this gets flattened into a complete mapping of all ancestors to all descendents. The graph is maintained in an SQL table for maintenance reasons (we already have the tools that can manipulate the graph as stored in MySQL). The graph, and thus the table, is relatively small, 4000 nodes and 2000 edges.

Traversing the entire graph using SQL queries in sequence (query each node when it is reached) took 25 minutes. Selecting everything from the table and generating the same data structure (parent->child relationships, not nested structures) in perl cut the run time down to 45 seconds. That's 45 seconds to traverse the entire graph. Of course, once it's all in memory, that kind of speed is to be expected.

I also had a 300 million row table composed only of a handful of fixed-width columns--not large, but nothing to sneeze at--that needed to be joined on itself. The cost in time to do this in MySQL was prohibitive (admittedly, the database could have used some optimization), even with the proper indexes being created and used. It was significantly faster to export the data to flat files, load it into an SQLite database and perform the relational operations in SQLite rather than MySQL, despite the additional time and effort of having having to do the export and import.

SQLite databases, while speedy, can often be treated like flat files, because they are self-contained and can be copied around to other machines in their entirety. You could think of this as getting the best of both worlds, the flexibilty and accessiblity of flat-files, with the power (should you need it) of relational structure.

  Bob [04.28.06 07:50 PM]

Databases are great, and I'm glad that MySQL, Postgres, and SQLite have brought relational technologies within everybody's reach, but there are definitely times when flat files are simpler and better.

If you could perform "commits" on updates to a set of flat files in a transactional way, that would get make flat files even more attractive as an alternative to "ACID"-compliant databases.

  Chris M. [04.28.06 10:24 PM]

one advantage of a database is it makes it easy to change your schema, as phil pointed out. If your project is at all visionary or cutting-edge, that freedom is indispensible, since you never know what direction you'll be going in next. What kind of relationships will be the most important ones, etc.

If you're lucky though, and then smart, you'll be able to cache essentially all of your application in "flat" html files in front of your database. You regenerate only the ones you need to when the database is modified, which should be infrequently. This is the approach Rails takes. Then also your pages can be comprised of up to dozens of small bits and pieces, so that when a master page is invalidated, the database is only hit to regenerate just the part that was changed and the regeneration only takes a half second or something.

In the Rails world basically there are three levels of slowness in a webapp. Apache serving a flat file is at the fast end. A scripting engine assembling flat-file fragments then serving them as an assembled page is in the middle. A scripting engine having to consult the database about something is at the bottom. You want to avoid the latter in 99% of cases. This will allow you to scale your app cheaply, so you don't get clueless investors telling you how to live your life.

Hard part is, you need to shape your features to support this kind of architecture. Private pages that some people can see and others can't aren't a great match for it, though they can still benefit from memcached style object and fragment caching.

Fortunately many apps in this day and age are privileging public, open exchange of data and radically simplified permissioning ...

  Michael Daines [04.30.06 05:24 PM]

This was good to read, and I think I certainly agree that flat files are probably a good implementation for things like blog posts or web-pages, where the structure in a database would be some kind of key identifying each thing (something like an URL) and some block of data you can't assume all that much about (like a blog post).



Plus, in cases like the article mentions, it's probably not such a big deal to worry about which to choose, because you can keep the interface you say "get some blog post" to separate from how it's implemented. In some theoretical system, you might be doing a variety of things when you say this anyway. You might be fetching web-pages from the internet, you might be going to memcached, you might be looking in the file system...

  ritu [09.25.06 03:46 AM]

are there any advantages of using files instead of any database?plz explain

  Gavin [12.07.06 01:49 PM]

The time to learn a relational db and then special cases to back up, etc., are big time inputs needed for rdbms. Do not forget that in the 'against' category..

  harini [02.06.07 08:02 AM]


I want to know, what is the advantage of file processing system over database management system?
in the entire search in the net i found only advantage of DBMS,if so what is the need for file?

  John K [03.22.07 05:26 AM]

I realize I'm WAY late in adding my comments to this discussion but there's a terribly emotional debate about the use of flat files over a database going on over at my blog.

I posted a short tutorial explaining why and how I used flat files to built a site of mine and man, the flames started ROLLING in.

http://themicrobusinessexperiment.blogspot.com/2007/03/how-to-build-fully-functioning-website.html

I also posted a rebuttal here:

http://themicrobusinessexperiment.blogspot.com/2007/03/response-to-those-who-hate-my-how-to.html

I think it's so interesting how this topic completely polarizes developers. Either you "get" why someone would use a flat file or you just don't get it.

  adminos [05.25.09 02:36 AM]

It's got a lot of truth to it though, but it all depends on what you store and how.

If you've got a site that's been setup in a Dreamweaver-style template system, it rather pales in comparison to a nice "real" template based MySQL application.

That doesn't have to be the only way though, you can just as easily store flat files of XML data and utilize something like XSLT on the server side to transform it into whatever format you desire.

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS