How Facebook satisfied a need for speed

Remember how Facebook used to lumber and strain? And have you noticed how it doesn’t feel slow anymore? That’s because the engineering team pulled off an impressive feat: an in-depth optimization and rewrite project made the site twice as fast.

Robert Johnson, Facebook’s director of engineering and a speaker at the upcoming Velocity and OSCON conferences, discusses that project and its accompanying lessons learned below. Johnson’s insights have broad application — you don’t need hundreds of millions of users to reap the rewards.

Facebook recently overhauled its platform to improve performance. How long did that process take to complete?

Robert Johnson: Making the site faster isn’t something we’re ever really done with, but we did make a big push the second half of last year. It took about a month of planning and six months of work to make the site twice as fast.

What big technical changes were made during the rewrite?

Robert Johnson: The two biggest changes were to pipeline the page content to overlap generation, network, and render time, and to move to a very small core JavaScript library for features that are required on the initial page load.

The pipelining project was called BigPipe, and it streams content back to the browser as soon as it’s ready. The browser can start downloading static resources and render the most important parts of the page while the server is still generating the rest of the page. The new JavaScript library is called Primer.

In addition to these big site-wide projects, we also performed a lot of general cleanup to make everything smaller and lighter, and we incorporated best practices such as image spriting.

Were developers encouraged to work in different ways?

This was one of the trickiest parts of the project. Moving fast is one of our most important values, and we didn’t want to do anything to slow down development. So most of our focus was on building tools to make things perform well when developers do the things that are easiest for them. For example, with Primer, making it easy to integrate and hard to misuse was as important to its design as making it fast.

We also built detailed monitoring of everything that could affect performance, and set up systems to check code before release.

It’s really important that developers be automatically alerted when there’s a problem, instead of developers having to go out of their way for every change. That way, people can continue innovating quickly, and only stop to deal with performance in the relatively unusual case that they’ve caused a problem.

How do you address exponential growth? How do you get ahead of it?

You never get ahead of everything, but you have to keep ahead of most things most of the time. So whenever you go in to make a particular system scale better, you can’t settle for twice as good, you really need to shoot for 10 or 100 times as good. Making something twice as good only buys a few months, and you’re back at it again as soon as you’re done.

In general, this means scaling things by allowing greater federation and parallelism and not just making things more efficient. Efficiency is of course important, too, but it’s really a separate issue.

Two other important things: have good data about how things are trending so you catch problems before you’re in trouble, and test everything you can before you have to rely on it.

In most cases the easiest way for us to test something new is to put it in production for a small number of users or on a small number of machines. For things that are completely new, we set up “dark launches” that are invisible to the user but mimic the load from the real product as much as possible. For example, before we launched chat we had millions of JavaScript clients connecting to our backend to make sure it could handle the load.

Facebook’s size and traffic aren’t representative of most sites, but are there speed and scaling lessons you’ve learned that have universal application?

The most important one isn’t novel, but it’s worth repeating: scale everything horizontally.

For example, if you had a database for users that couldn’t handle the load, you might decide to break it into two functions — say, accounts and profiles — and put them on different databases. This would get you through the day but it’s a lot of work and it only buys you twice the capacity. Instead, you should write the code to handle the case where two users aren’t on the same database. This is probably even more work than splitting the application code in half, but it will continue to pay off for a very long time.

The most important thing here isn’t to have fancy systems for failover or load balancing. In fact, those things tend to take a lot of time and get you in trouble if you don’t get them right. You really just need to be able to split any function to run on multiple machines that operate as independently as possible.

The second lesson is to measure everything you can. Performance bottlenecks and scaling problems are often in unexpected places. The things you think will be hard are often not the biggest problems, because they’re the things you’ve thought about a lot. It’s actually a lot more like debugging than people realize. You can’t be sure your product doesn’t have bugs just by looking at the code, and similarly you can’t be sure your product will scale because you designed it well. You have to actually set it up and pound it with traffic — real or test — and measure what happens.

What is Scribe? How is it used within Facebook?

Scribe is a system we wrote to aggregate log data from thousands of servers. It turned out to be generally useful in a lot of places where you need to move large amounts of data asynchronously and you don’t need database-level reliability.

Scribe scales extremely large — I think we do more than 100 billion messages a day now. It has a simple and easy-to-use interface, and it handles temporary network or machine failures nicely.

We use Scribe for everything from logging performance data, to updating search indexes, to gathering metrics for platform apps and pages. There are more than 100 different logs in use at Facebook today.

I was struck by a phrase in one of your recent blog posts: You said Scribe has a “reasonable level of reliability for a lot of use cases.” How did you sell that internally?

For some use cases I didn’t. We can’t use the system for user data because it’s not sufficiently reliable, and keeping user data safe is something we take extremely seriously.

But there are a lot of things that aren’t user data, and in practice, data loss in Scribe is extremely rare. For many use cases it’s well worth it to be able to collect a massive amount of data.

For example, the statistics we provide to page owners depend on a large amount of data logged from the site. Some of this is from large pages where we could just take a sample of the data, but most of it is from small pages that need detailed reporting and can’t be sampled. A rare gap in this data is much better than having to limit the number of things we’re able to report to page owners, or only giving approximate numbers that aren’t useful for smaller pages.

This interview was condensed and edited.

Robert Johnson will discuss Facebook’s optimization techniques at the Velocity Conference (6/22-6/24) and OSCON (7/19-7/23).