Operations

 

Wed

Jul 1
2009

Steve Souders

Velocity and the Bottom Line

by Steve Souderscomments: 0

Velocity 2009 took place last week in San Jose, with Jesse Robbins and I serving as co-chairs. Back in November 2008, while we were planning Velocity, I said I wanted to highlight "best practices in performance and operations that improve the user experience as well as the company's bottom line." Much of my work focuses on the how of improving performance - tips developers use to create even faster web sites. What's been missing is the why. Why is it important for companies to focus on performance?

That question was answered at Velocity last week by speakers from AOL, Google, Microsoft, and Shopzilla.

  • Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What's more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)
  • Dave Artz from AOL presented several performance suggestions. He concluded with statistics that show page views drop off as page load times increase. Users in the top decile of page load times view ~7.5 pages/visit. This drops to ~6 pages/visit in the 3rd decile, and bottoms out at ~5 pages/visit for users with the slowest page load times. (slides)
  • Marissa Mayer shared several performance case studies from Google. One experiment increased the number of search results per page from 10 to 30, with a corresponding increase in page load times from 400 milliseconds to 900 milliseconds. This resulted in a 25% dropoff in first result page searches. Adding the checkout icon (a shopping cart) to search results made the page 2% slower with a corresponding 2% drop in searches/user. (Watch the video to see the clever workaround they found.) Image optimizations in Google Maps made the page 2-3x faster, with significant increase in user interaction with the site. (video, slides)
  • Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

These case studies provide real world numbers that show the benefits of making your site faster. Other Velocity sessions share techniques for implementing performance improvements, including sessions from me, Doug Crockford, and the Facebook and Google frontend teams. But what about the user experience? In his session, Matt Mullenweg (of WordPress fame) makes sure we remember the importance of how the user feels while interacting with our site:

That's why [performance] is important and why we should be obsessed and not be discouraged when it doesn't change the funnel. My theory here is when an interface is faster, you feel good. And ultimately what that comes down to is you feel in control. The web app isn't controlling me, I'm controlling it. Ultimately that feeling of control translates to happiness in everyone. In order to increase the happiness in the world, we all have to keep working on this.

Thanks to the Velocity speakers & their organizations for overcoming the many challenges required to present this data for the first time. We're now equipped with the financial justification, the technical know-how, and the visceral motivation to go out and make the Web a faster place. We'll have more performance success stories next year. Your company could be one of them! Capture your performance improvements and bottom line impact. We'd love to hear from you at Velocity 2010.

 

Mon

Jun 29
2009

Nat Torkington

Four short links: 29 June 2009

Syadmin Wiki, Physics, National Archives, and Reinventing the British Government

by Nat Torkington@gnatcomments: 1

  1. Server Fault -- Wikipedia-like sysadmin guide, built by the Stack Overflow team, who are branching out to reach a more general IT Professional audience. (via Brady in email)
  2. Sixty Symbols -- 5m videos about the symbols of physics and astronomy. Great stuff! (via Glutnix on Twitter)
  3. US National Archives launches YouTube Channel -- a mixture of archives-nerd stuff (directors of Presidential Libraries talking about their favourite items) and wider-interest collections (such as Touring 1930s America).
  4. Open House in Westminster -- the ever-insightful Tom Steinberg from MySociety has an article in the Independent about British plans to reinvent government. Now the talk of Westminster is all about democratic reform. By my count there are over 50 different ideas for changing the way our democracy works being touted by different pundits at the moment. [...] What all these ideas, though, have in common is that they propose structural reforms that could have been achieved any time in the last 200 years.[...] My view is that these proposals are all interesting, and some may be quite critical for a better democracy. But I am also concerned that they do not see Parliament and the process of making laws as a native to the internet would. They don’t ask: “What reforms are possible that just weren’t conceivable ten years ago?”

 

Wed

Jun 24
2009

Jesse Robbins

Jonathan Heiliger on Web Performance, Operations, and Culture

by Jesse Robbins@jesserobbinscomments: 0

We were honored to have Jonathan Heiliger, Facebook’s VP of Technology Operations, as our opening keynote speaker at Velocity. Jonathan is one of the most accomplished leaders in our field, and is a master of the craft.

Here is his keynote in its entirety:

Note: Other videos from Velocity are being posted to VelocityConference.blip.tv

 

Fri

Jun 19
2009

Scott Ruthfield

Announcing: Spike Night at Velocity

by Scott Ruthfield@scottrucomments: 5

Guest blogger Scott Ruthfield is a Program Committee member of the O'Reilly Velocity: Web Performance & Operations Conference. 


Web Operations is not for the casual observer: it's for a particular kind of adrenaline junkie that's motivated by graphs and servers spinning out of control.  Jumping in, on-your-feet analysis, and experience-based-experimentation are all part of solving new problems caused by unexpected user and machine behavior, and keeping a clear head when service owners and executives are panicking is part of the job. 

A core part of operations leadership is spike management - what you do when you see a significantly larger amount of load than you've had before. Sometimes this is predictable months out (Amazon knows, for example, that the first or second Monday of December will be their biggest day each year), sometimes days out (Twitter knew Oprah was coming), and sometimes not at all (what we still call the Slashdot Effect). Every web ops professional deals with some kind of spike - even intranets manage paydays and employee review days - and if you're into it, well, spikes can be fun. Of course, maybe you use EC2 Auto-Scaling, and so (in theory) don't have to worry about it, although of course bottlenecks come in many forms.

So at Velocity this year, we're trying out something new: Spike Night.

Spike Night is a chance to see and learn about how real, high-traffic websites deal with massive increases in load, either expected or unexpected. We'll see real-world management of traffic increases - graphs, tools, the whole shebang.

Now, it turns out that when I called up lots of people on the phone and said "can we throw massive load at your website so you can stand on stage and brag about it," many web ops folks were excited, but then they start worrying about little things like "what if something goes wrong and everyone blogs about it" or "do I have to ask somebody in a PR department" and then calls went unreturned. 

Fortunately, two parties have stepped up, and I can't wait to see what they have to show:
  • Chris Bissell, Chief Software Architect at MySpace, and members of the MySpace team will demonstrate a massive, real increase in traffic, and will manage it on-stage. MySpace already deals with tens of thousands of hits each second - we can't throw enough traffic at them to cause any harm - so they'll cause their own harm and then show how they work through it.
  • Ryan NelsonOperations Director for MLB Advanced Media and MLB.com, will walk us through a combination of war stories and live traffic management to show what happens when millions of baseball fans all want to see what's happened after the commercial break at the exact same time. Between their very popular desktop apps and their newly-announced iPhone game streaming, the MLB is a true leader in technology innovation with a rabid fan base that goes well beyond the Web 2.0 echo chamber.
Spike Night is meant to be a fun event, taking place Tuesday June 23rd @ 7:30PM at Velocity, and open to the larger web community - a Velocity conference pass is not required to attend. I'm looking forward to hosting interesting demos and a fun Q&A, and hope to see all of you there!

 

Mon

Jun 8
2009

Jesse Robbins

Ignite! comes to San Jose June 22nd - Submit your talks now!

by Jesse Robbins@jesserobbinscomments: 0

Ignite! VelocityIgnite! is coming to San Jose on Monday June 22, 2009 at 8:00 pm, attached to the Velocity Conference. Admission is free, open to all, and there will be a cash bar.

The deadline for talks is May 11th, so submit your talks now!

As with all Ignites each speaker will only get 20 slides that each auto-advance every 15 seconds for a total of five minutes. We'll be looking for fun geek topics like hacks, how-to's, and insights. (Talks don't have to be Velocity-related!) If you're not sure what an Ignite talk looks like check out the Ignite Show.

You can RSVP for the event on Upcoming or Facebook.

 

Mon

May 18
2009

James Turner

Velocity Preview - The Greatest Good for the Greatest Number at Microsoft

by James Turnercomments: 4

You may also download this file. Running time: 00:20:26

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

The psychology of engineering user experiences on the web can be difficult. How much rich content can you place up on a page before the load time drives away your visitors? Get the answer wrong, and you can end up with a ghost town; get it right and you're a star. Eric Schurman knows this well, since he is responsible for just those kind of trade-off decisions on some of Microsoft's highest traffic pages. He'll be speaking at O'Reilly's Velocity Conference in June, and he recently talked with us about how Microsoft tests different user experiences on small groups of visitors.

James Turner: Why don't you start by describing what your gig at Microsoft is now and what your career path has been there?

Eric Schurman: I'm a principal dev lead for Live Search, what used to be MSN Search. And I started at Microsoft back in the late 90s working in Microsoft's Press organization, where we actually were developing training software that would emulate new Microsoft products, but didn't require those products to be on a user's machine. So, for example, if you had an organization that was running Windows 95, we would have a training system for Windows 98 that would emulate a bunch of the functionality of Windows 98 so that you could deploy it to your people. They could train their people on how to use Windows 98 before they actually deployed it.

I then moved on to the Microsoft Press website, where I became the dev lead for it. I made a few other moves and ended up going to Microsoft.com, where I ran the download center, the Microsoft.com homepage, the product catalog, and a bunch of other places from a dev perspective.

velocity2009_336x280.gifI then moved to what was then MSN Search, back in about 2005, and was there through the MSN to Live transition. At the time, I wasn't working on performance; I was just working on the Live Search application. And it became very obvious that we had some major performance problems. Performance has always been one of my really strong interests, so I took on addressing a lot of those. And when we addressed them, we had very significant improvements in our business metrics. That really surfaced how important performance was to the organization, and I moved into a role where I was really focusing just on performance. I've been in that role now for about two years.

JT: You've worked on at least three very different parts of the Microsoft website. The homepage has lots of hits, fairly static. The download page is a lot of data for long periods of time. Live Search is high volume, but there's also a lot of backend on that. In what ways do you need to architect them differently? And where can you reuse the same lessons?

ES:: That's a great question. On the web, you've got different concerns on what you have for client apps. The main things that tend to impact end-user perceived performance on the web are often things about how you've designed your application from a network perspective. So how many different HTTP get requests are you making? How are those get requests structured? So, for example, are they serialized? Did you have a JavaScript file that then gets returned to the browser that requests another JavaScript file and another JavaScript file and then some content and then it finally gets rendered? So the number of assets that you request, that's going to be something that's important no matter what product your doing.

There are other things, like how much script do you have on the page, how much CSS you have on the page, how much actual content are your rendering to the page, etcetera. There are tricks that you can use like combining many different graphics into a single tiled image and sending that down to the browser. It's much faster to send one image to the browser than, say, 20 images. Even if you end up sending the same overall graphics, but combined into one, it's still must faster to send it as one request.

There are also different data volume concerns. They're also different from a business perspective. A lot of what we were sending out from the download center was extremely time critical. We would have an update go out, and we needed to make sure that update was going to be available anywhere in the world within a certain time frame, which required us to handle very high bandwidth, and a very high volume of requests coming into the site that were transferring lots of bits. So that required something totally different than something like the Microsoft.com homepage.

It's also interesting looking at the volume of traffic and how that traffic reflects real users. So, for example, one of the problems that you end up with on both the Microsoft homepage and Live Search is that we have a huge number of bots that are trying to hit the system, lots of people trying to do SEO work are trying to hit search engines to gather information about their site, about competitor sites, about all sorts of things. On the Microsoft.com homepage, it's always under distributed denial of service attacks. It's not a question of how frequently does it happen; it's just what is the rate right now? Also, the Microsoft.com homepage has historically had such a high up-time rate that it's actually hit by a lot of hardware devices simply to check for connectivity to the internet. And so you'd want to treat a request from that kind of "user" very differently from a request that's coming from a real user.

So that's kind of a long, rambling answer to your question. Do you have any areas that you want me to drill in or maybe talk about something else?

(continue reading)

 

Fri

May 8
2009

Jesse Robbins

Velocity 2009 - Big Ideas (early registration deadline)

by Jesse Robbins@jesserobbinscomments: 6

what-is-velocityconf.png

(tag cloud created from Velocity session & speaker information using wordle.net)

My favorite interview question to ask candidates is: "What happens when you type www.(amazon|google|yahoo).com in your browser and press return?"

While the actual process of serving and rendering a page takes seconds to complete, describing it in real detail can take an hour. A good answer spans every part of the Internet from the client browser & operating system, DNS, through the network, to load balancers, servers, services, storage, down to the operating system & hardware, and all the way back again to the browser. It requires an understanding of TCP/IP, HTTP, & SSL deep enough to describe how connections are managed, how load-balancers work, and how certificates are exchanged and validated... and that's just the first request!

Web Performance & Operations is an emerging discipline which requires incredible breadth, focusing less on specific technologies and more on how the entire system works together. While people often specialize on particular components, great engineers always think of that component in relation to the whole. The best engineers are able to fly to the 50,000 foot view and see the entire system in motion and then zoom in to microscopic levels and examine the tiny movements of an individual part.

John Allspaw recently described this interconnectedness on his blog:

With websites, the introduction of change (for example, a bad database query) can affect (in a bad way) the entire system, not just the component(s) that saw the change. Adding handfuls of milliseconds to a query that’s made often, and you’re now holding page requests up longer. The same thing applies to optimizations as well. Break that [bad] query into two small fast ones, and watch how usage can change all over the system pretty quickly. Databases respond a bit faster, pages get built quicker, which means users click on more links, etc. This second-order effect of optimization is probably pretty familiar to those of us running sites of decent scale.

Working with these systems requires an understanding not only of the way technology interacts, but the way that people do as well. The structure, operation, and development of a website mirrors the organization that creates it, which is why so many people in WebOps focus on understanding and improving management culture & process.

Organizing a conference like Velocity is a wonderful challenge because it requires the same sort of thinking. We focus on the big concepts that everyone needs to know and then go deep into the technologies that change our understanding of the system. We find ways to share the unique experience that can only be gained by operating at scale. We make it safe to share as much of the "Secret Sauce" as we can.

Please join us at Velocity this year, we have an amazing lineup of speakers & participants. Early registration ends on Monday, May 11th at 11:59 PM Pacific. (Radar readers can use "vel09cmb" for an additional 15% discount.)

Velocity, the Web Performance and Operations Conference 2009

 

Thu

May 7
2009

James Turner

Velocity Preview - Keeping Twitter Tweeting

by James Turnercomments: 3

You may also download this file. Running time: 00:10:46

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

If there's a site that exemplifies explosive growth, it has to be Twitter. It seems like everywhere you look, someone is Tweeting, or talking about Tweeting, or Tweeting about Tweeting. Keeping the site responsive under that type of increase is no easy job, but it's one that John Adams has to deal with every day, working in Twitter Operations. He'll be talking about that work at O'Reilly's Velocity Conference, in a session entitled Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site, and he spent some time with us to talk about what is involved in keeping the site alive.

James Turner: Can you start by describing the platforms and technologies that make Twitter run today?

John Adams: Twitter currently runs on Ruby on Rails. And we also use a combination of Java and Scala, and a number of homegrown scripts that run the site. We also use a lot of open-source tools like Apache, MySQL, memcached.

twitter_logo_header.pngJT: What type of hardware are you running on?

JA: It's all Linux, so a lot of x86 hardware. I can't tell you the brands or how many.

JT: Do you make any kind of attempt to stay homogeneous in that?

JA: Yes, we do. All of our hardware is very consistent. It makes deployment of new software very easy. And we also use a number of configuration management tools like Puppet to deliver software to those machines.

JT: As anyone can see, Twitter has had a pretty explosive growth, especially recently. Were you prepared for this kind of ramp up?

JA: I don't think so. I mean we're growing week over week in enormous numbers. And we spend a lot of time calculating the growth and scalability of the site to make sure that we can handle the upcoming load.

JT: I mean obviously there are events like Oprah decides she's going to Tweet that are going to be spikes. Do you try to get warning of that stuff?

JA: Yeah. And frequently we know of major events happening. Major events are very predictable like Macworld, even any massive amount of media interaction, we have some fair warning beforehand.

(continue reading)

 

Fri

Apr 10
2009

Jesse Robbins

AT&T Fiber cuts remind us: Location is a Basket too!

by Jesse Robbins@jesserobbinscomments: 3

The fiber cuts affecting much of the San Francisco Bay Area this week are similar to the outages in the Middle East last year (radar post), although far more limited in scope and impact.   What I said last year still holds true and is repeated below: 

From an operations perspective these kinds of outages are nothing new, and underscore why having "many eggs in few baskets" is such a problem. I believe we will see similar incidents when we have the first multi-datacenter failures where multiple providers lose significant parts of their infrastructure in a single geographic area.

Remember: Don't put all your eggs in one basket... and Location is a basket too!

To really understand the issue, I recommend Neal Stephenson's incredible (and lengthy) Wired article from 1996 entitled "Mother Earth Mother Board":

[...] It sometimes seems as though every force of nature, every flaw in the human character, and every biological organism on the planet is engaged in a competition to see which can sever the most cables. The Museum of Submarine Telegraphy in Porthcurno, England, has a display of wrecked cables bracketed to a slab of wood. Each is labeled with its cause of failure, some of which sound dramatic, some cryptic, some both: trawler maul, spewed core, intermittent disconnection, strained core, teredo worms, crab's nest, perished core, fish bite, even "spliced by Italians." The teredo worm is like a science fiction creature, a bivalve with a rasp-edged shell that it uses like a buzz saw to cut through wood - or through submarine cables. Cable companies learned the hard way, early on, that it likes to eat gutta-percha, and subsequent cables received a helical wrapping of copper tape to stop it.

[...] There is also the obvious threat of sabotage by a hostile government, but, surprisingly, this almost never happens. When cypherpunk Doug Barnes was researching his Caribbean project, he spent some time looking into this, because it was exactly the kind of threat he was worried about in the case of a data haven. Somewhat to his own surprise and relief, he concluded that it simply wasn't going to happen. "Cutting a submarine cable," Barnes says, "is like starting a nuclear war. It's easy to do, the results are devastating, and as soon as one country does it, all of the others will retaliate."

As the capacity of optical fibers climbs, so does the economic damage caused when the cable is severed. FLAG makes its money by selling capacity to long-distance carriers, who turn around and resell it to end users at rates that are increasingly determined by what the market will bear. If FLAG gets chopped, no calls get through. The carriers' phone calls get routed to FLAG's competitors (other cables or satellites), and FLAG loses the revenue represented by those calls until the cable is repaired. The amount of revenue it loses is a function of how many calls the cable is physically capable of carrying, how close to capacity the cable is running, and what prices the market will bear for calls on the broken cable segment. In other words, a break between Dubai and Bombay might cost FLAG more in revenue loss than a break between Korea and Japan if calls between Dubai and Bombay cost more.

The rule of thumb for calculating revenue loss works like this: for every penny per minute that the long distance market will bear on a particular route, the loss of revenue, should FLAG be severed on that route, is about $3,000 a minute. So if calls on that route are a dime a minute, the damage is $30,000 a minute, and if calls are a dollar a minute, the damage is almost a third of a million dollars for every minute the cable is down. Upcoming advances in fiber bandwidth may push this figure, for some cables, past the million-dollar-a-minute mark. [Link]

It's also worth mentioning the outages to multiple service providers hosted in a single colocation facility when the FBI sized all the equipment in the facility, the big outage at 365 Main from two years ago, and many others (see: Radar posts & comprehensive coverage at Data Center Knowledge).

(If Web Operations & Infrastructure is your interest or passion, you should attend Velocity 2009.  You can use the code "vel09cmb" for a 15% discount)

velocity2009.gif
(Image source: http://www.flickr.com/photos/mundane_joy/2301368102/)

 

Thu

Feb 26
2009

Simon Wardley

Karmic Koalas Love Eucalyptus

by Simon Wardleycomments: 7

Guest blogger Simon Wardley, a geneticist with a love of mathematics and a fascination for economics, is the Software Services Manager for Canonical, helping define future cloud computing strategies for Ubuntu. Simon is a passionate advocate and researcher in the fields of open source, commoditization, innovation, and cybernetics.

Mark Shuttleworth recently announced that the release of Ubuntu 9.10 will be code-named Karmic Koala. Whilst many of the developments around Ubuntu 9.10 are focused on the desktop, a significant effort is being made on the server release to bring Ubuntu into the cloud computing space. The cloud effort begins with 9.04 and the launch of a technology preview of Eucalyptus, an open sourced system for creating Amazon EC2-like clouds, on Ubuntu.

I thought I'd discuss some of the reasoning behind Ubuntu's Cloud Computing strategy. Rather than just give a definition of cloud computing, I'll start with a closer look at its underlying causes.

The computing stack is comprised of many layers, from the applications we write, to the platforms we develop in and the infrastructure we build upon. Some activities at various layers of this stack have become so ubiquitous and well defined that they are now suitable for service provision through volume operations. This has led to the growth of the 'as a Service' industries, with providers like Amazon EC2 and Force.com.

Information Technology's shift from a product to a service-based economy brings with it both advantage and disruption. On the one hand, the shift offers numerous benefits including economies of scale (through volume operations), focus on core activities (outsourcing), acceleration in innovation (componentisation), and pay per use (utility charging). On the other hand, many concerns remain, some relating to the transitional nature of this shift (management, security and trust), while others pertain to the general outsourcing of any common activity (second sourcing options, competitive pricing pressures and lock-in). These concerns create significant adoption barriers for the cloud.

At Canonical, the company that sponsors and supports Ubuntu, we intend to provide our users with the ability to build their own clouds whilst promoting standards for the cloud computing space. We want to encourage the formation of competitive marketplaces for cloud services with users having choice, freedom, and portability between providers. In a nutshell, and with all due apologies to Isaac Asimov, our aim is to enable our users with 'Three Rules Happy' cloud computing. That is to say:

  • Rule 1: I want to run the service on my own infrastructure.

  • Rule 2: I want to easily migrate the service from my infrastructure to a cloud provider and vice versa with a few clicks of a button.

  • Rule 3: I want to easily migrate the service from one cloud provider to another with a few clicks of a button.

(continue reading)

 

Thu

Feb 12
2009

Artur Bergman

Cloud Computing defined by Berkeley RAD Labs

by Artur Bergmancomments: 4

I am pleased to finally have found a paper that manages to bring together the different aspects of cloud computing in a coherent fashion, and suggests the requirements for it to develop further.

Written by the Berkeley RAD Lab (UC Berkeley Reliable Adaptive Distributed Systems Laboratory) the paper succinctly brings together Software as a Service with Utility Computing to come up with a workable definition of Cloud Computing and is a recommended read.

The services themselves have long been referred to as Software as a Service (SaaS). The datacenter hardware and software is what we will call a Cloud. When a Cloud is made available in a pay-as-you-go manner to the general public, we call it a Public Cloud; the service being sold is Utility Computing. We use the term Private Cloud to refer to internal datacenters of a business or other organization, not made available to the general public. Thus, Cloud Computing is the sum of SaaS and Utility Computing, but does not include Private Clouds.

Exploring the difference between the raw service of Amazon EC2 to the high level web centered Google App Engine, the highlights are:

  • Insight into the pay-as-you go aspect with no commits
  • Analysis of cost with regards to peak and elasticity in face of unknown demand
  • Cost of data transfers versus processing time
  • Seamless migration of user to cloud processing
  • Limits and problems with I/O on shared hardware
They raise the following obstacles and opportunities, echoing Tim's posts on Open Source and Cloud Computing and Web 2.0 and Cloud Computing.
  • Availability of Service
  • Data Lock-In
  • Data Confidentiality and Auditability
  • Data Transfer Bottlenecks
  • Performance Unpredictability
  • Scalable Storage
  • Bugs in Large-Scale Distributed Systems
  • Scaling Quickly
  • Reputation Fate Sharing
  • Software Licensing

I particularly find interesting the analysis of transportation cost versus computing cost; when is it more efficient to to use EC2 than your own individual processing? I predict speed of light and available of raw transfer capacity is going to become a even larger obstacle. (Both inside computers, between them on local LANs and on WANs.)

The paper reinforces my belief in the cloud, but that we need open source cloud environments and a larger ecosystem of providers.

Read more on the Above the Clouds blog.

 

Thu

Feb 5
2009

Jesse Robbins

Understanding Web Operations Culture - the Graph & Data Obsession

by Jesse Robbins@jesserobbinscomments: 7

We’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time.

-John Allspaw, Operations Engineering Manager at Flickr & author of The Art of Capacity Planning

One of the most interesting parts of running a large website is watching the effects of unrelated events affecting user traffic in aggregate. Web traffic is something that companies typically keep very secret, and often the only time engineers can talk about it is late at night, at a bar, and very much off the record.

There are many good reasons for keeping this kind of information confidential, particularly for publicly traded companies with complicated disclosure requirements. There are also downsides, the biggest being that is difficult for peers to learn from each other and compare notes.

John Allspaw recently created a WebOps Visualizations group on Flickr for sharing these kinds of graphs with the confidential information removed. Here’s an example of a traffic drop seen both by Flickr & by Last.FM that coincided with President Obama’s inauguration.

John Allspaw shows drop in web traffic to Flickr during Obama inauguration

Similar traffic drop on Last.FM seen on the right

Traffic Drop to Last.FM during Obama inauguration on right

Google saw a similar drop as well

Traffic Drop to Google during Obama Inauguration

Was it because everybody went to Twitter?

Traffic Spike on Twitter during Obama Inauguration

Besides being an interesting story, sharing these kinds of graphs help people build better monitoring tools and processes. As just one example: How should the WebOps team respond to this dip in traffic? Is it an outage? The inaguration was a very well known event and so it’s easy to explain the drop in traffic… what happens when a similar drop in traffic occurs? Should the WebOps team be looking at CNN (or trends in twitter) along with everything else?

How do you tell when that unexpected 10% drop in traffic is really just people with something more important to do than browse your site?

(Note: Updated since original posting to add Google & Twitter graphs and annotations, and to switch the Last.FM graphic with an annotated one after I got permission.)

 

Recent Posts

 

TIM'S TWITTER UPDATES

RECOMMENDED FOR YOU

CURRENT CONFERENCES