Velocity Preview - The Greatest Good for the Greatest Number at Microsoft

The psychology of engineering user experiences on the web can be difficult. How much rich content can you place up on a page before the load time drives away your visitors? Get the answer wrong, and you can end up with a ghost town; get it right and you’re a star. Eric Schurman knows this well, since he is responsible for just those kind of trade-off decisions on some of Microsoft’s highest traffic pages. He’ll be speaking at O’Reilly’s Velocity Conference in June, and he recently talked with us about how Microsoft tests different user experiences on small groups of visitors.

James Turner: Why don’t you start by describing what your gig at Microsoft is now and what your career path has been there?

Eric Schurman: I’m a principal dev lead for Live Search, what used to be MSN Search. And I started at Microsoft back in the late 90s working in Microsoft’s Press organization, where we actually were developing training software that would emulate new Microsoft products, but didn’t require those products to be on a user’s machine. So, for example, if you had an organization that was running Windows 95, we would have a training system for Windows 98 that would emulate a bunch of the functionality of Windows 98 so that you could deploy it to your people. They could train their people on how to use Windows 98 before they actually deployed it.

I then moved on to the Microsoft Press website, where I became the dev lead for it. I made a few other moves and ended up going to Microsoft.com, where I ran the download center, the Microsoft.com homepage, the product catalog, and a bunch of other places from a dev perspective.

I then moved to what was then MSN Search, back in about 2005, and was there through the MSN to Live transition. At the time, I wasn’t working on performance; I was just working on the Live Search application. And it became very obvious that we had some major performance problems. Performance has always been one of my really strong interests, so I took on addressing a lot of those. And when we addressed them, we had very significant improvements in our business metrics. That really surfaced how important performance was to the organization, and I moved into a role where I was really focusing just on performance. I’ve been in that role now for about two years.

JT: You’ve worked on at least three very different parts of the Microsoft website. The homepage has lots of hits, fairly static. The download page is a lot of data for long periods of time. Live Search is high volume, but there’s also a lot of backend on that. In what ways do you need to architect them differently? And where can you reuse the same lessons?

ES:: That’s a great question. On the web, you’ve got different concerns on what you have for client apps. The main things that tend to impact end-user perceived performance on the web are often things about how you’ve designed your application from a network perspective. So how many different HTTP get requests are you making? How are those get requests structured? So, for example, are they serialized? Did you have a JavaScript file that then gets returned to the browser that requests another JavaScript file and another JavaScript file and then some content and then it finally gets rendered? So the number of assets that you request, that’s going to be something that’s important no matter what product your doing.

There are other things, like how much script do you have on the page, how much CSS you have on the page, how much actual content are your rendering to the page, etcetera. There are tricks that you can use like combining many different graphics into a single tiled image and sending that down to the browser. It’s much faster to send one image to the browser than, say, 20 images. Even if you end up sending the same overall graphics, but combined into one, it’s still must faster to send it as one request.

There are also different data volume concerns. They’re also different from a business perspective. A lot of what we were sending out from the download center was extremely time critical. We would have an update go out, and we needed to make sure that update was going to be available anywhere in the world within a certain time frame, which required us to handle very high bandwidth, and a very high volume of requests coming into the site that were transferring lots of bits. So that required something totally different than something like the Microsoft.com homepage.

It’s also interesting looking at the volume of traffic and how that traffic reflects real users. So, for example, one of the problems that you end up with on both the Microsoft homepage and Live Search is that we have a huge number of bots that are trying to hit the system, lots of people trying to do SEO work are trying to hit search engines to gather information about their site, about competitor sites, about all sorts of things. On the Microsoft.com homepage, it’s always under distributed denial of service attacks. It’s not a question of how frequently does it happen; it’s just what is the rate right now? Also, the Microsoft.com homepage has historically had such a high up-time rate that it’s actually hit by a lot of hardware devices simply to check for connectivity to the internet. And so you’d want to treat a request from that kind of “user” very differently from a request that’s coming from a real user.

So that’s kind of a long, rambling answer to your question. Do you have any areas that you want me to drill in or maybe talk about something else?

JT: You mentioned denial of service, and certainly Microsoft is a high profile target. How much of the effort involved in running the sites is directed at thwarting that type of thing?

ES:: I wouldn’t estimate it at a percentage of effort. But it’s definitely an investment that we make in our infrastructure. So there’s lots of different ways that you can prevent that, ranging from preventing the requests from getting to your datacenter at all, preventing it at the load-balance sort of level, and preventing it at your application level. And we make efforts at all of those different levels. It becomes just part of the cost of doing business. We also make some systems that help take that load before it would ever hit the application. So, yeah, it’s definitely something that we work on regularly. But it’s not a massive amount of our investment because we paid a lot of that cost years ago.

JT: I have to ask, given that just by the nature of the market share, a lot of the botnets are obviously Windows boxes, do you ever feel a little bit like you’re being shot with your own gun?

ES:: [Laughter] Well, we deal with the world the way it is. So a lot of the boxes out in the world are Windows boxes. So, yeah, they’re definitely going to be the ones running a lot of the botnets. A lot of the attacks that we get aren’t necessarily coming from botnets. A lot of them can even be — you can generate a lot of load simply from a very small number of machines. It’s also not necessarily about creating a lot of load. Often times, it can be just generating the right load. And so people are trying to attack us with maybe only a few clients that are just trying to search for some sort of vulnerability.

JT: Right. Or trying to use vulnerabilities in the protocols themselves.

ES:: Exactly. Exactly.

JT: How heavily does Microsoft eat their own latest and greatest dog food? Is everything C# now? Or is there some ancient code lurking around in places?

ES:: There’s always ancient code in any large organization. But it depends on what group you’re in, and it depends on the mission of that group. So when I was at Microsoft.com, which I can talk about more publicly than what goes on in Search, part of the core mission of Microsoft.com was to eat the dog food of the core platform teams. So, for example, we had live site — our major properties were running on the very first betas of ASP.NET more than a year before it was actually publicly released. So we were out there running all of our core sites on this code well before it was released.

It was really a great experience because — that may sound crazy, but it was so much better than what else was available at the time. It gave us the opportunity to give really great feedback to the team early on because it was a very large-scale site. And so it gave them a really good test bed and real customer, and it was good for us because we got to give feedback into the team in what’s really important to us as a big customer.

JT: Microsoft has done a lot of work providing dynamic language support through projects through IronRuby and IronPython. Do those technologies find any traction internally?

ES:: Oh, yeah, actually. I think it depends a lot on what group you’re in. Like in the research teams, they tend to be used more than the product teams that I’ve been on. But there’s definitely — we’re all geeks, you know? You get geeks and you get neat technologies and the geeks will try and use them. So I know that Iron Python, I know quite a few people who’ve experimented with it for everything from build tools to little data analysis tools.

And so, yeah, they definitely get used. Most of the time, people are gravitating towards C# for almost everything. But a lot of these tools do get used, absolutely.

JT: At Velocity, you’re going to be talking about deliberately degrading user experience to gather data. That seems to go against the basic DNA of every site administrator, even if it’s just for a small group. First of all, how do you go about pessimizing user experience? And what kind of data does it allow you to gather?

ES:: Like any of the big sites–Google, Yahoo, Amazon–we have an experimentation platform. And what this lets us do is take some fraction of our audience, some bucket of our audience and give them a different experience than other buckets of our audience. And so what we tend to do is have users in an experiment and users in a control. And we can give the experimental group pretty much any kind of experience that we want. And then we have an analysis system that runs after the fact and compares end-user methods for those two groups. And so we can track everything from how much money did we make from each organization, from each of the groups, to what was the click-through rate, to what was the perceived relevance. There’s all sorts of different things that we can track.

We make different business decisions based off of this data, and so one of the things that we’ve always wondered is how much of an impact does performance make on our key business metrics. What we’ve tried to do is come up with many different ways of measuring a performance impact. So in some cases, we’ve been able to take some upcoming improvement that we plan on releasing and, before we release it to everybody, we’ll release it to a small number of people first and see what the metrics show.

But most of the time, when we have something that we know is going to improve the experience for the users, we just want to get it out there to everybody as quickly as we can. In these cases, we ran several different classes of experiments. In some cases, we tried adding more content to the page, just in the form of HTML comments. So we made the page fatter. We made the page fatter in a bunch of different locations. We also tried adding different kinds of actual latencies to the page where we’re actually putting a pause on the server for a certain period of time. We were never routing more than a very tiny fraction of our users to this experience, and what we generally found was, depending on how much of a latency we were adding, or how much page weight we were adding, well, as you would expect, it impacted users different amounts depending on how much we did.

In cases where we were giving what was a significantly degraded experience, the data moved to significance extremely quickly. We were able to tell when we delayed people’s pages by more than half a second, and it was very obvious that this had a significant impact on users very quickly. And so we were able to turn off that experiment. So the reasoning that we did it was it helps us make a strong argument for how we can prioritize work on performance against work on other aspects of the site. I mean, I’m the performance guy; I don’t like hurting end-users in terms of performance. But if it helps me make a good, strong business argument to make other changes that will improve the experience for all of my users, for all time to come, and it means that a small segment of users for a small period of time will experience what we think will likely be a negative thing but we’re not sure, it was a test worth running.

JT: So it’s the greatest good for the greatest number paradigm.

ES:: Essentially. And there are all kinds of things where as soon as you start using a testing system, you’re going to end up with multiple experiences. And that’s the whole point is you’re testing experiences against each other. You’re going to have theories to begin with as to which one is a better experience, yet you’re still going to be subjecting a bunch of your users to the other experience. This was kind of the same thing. If we had honestly found that increasing the size of the page by 4X didn’t make a difference to users, we would’ve considered doing it because that’s good data. But that’s not what we found, you know? Hey, it does get worse. But we were able to identify also, “Hmm, if we put data on the page in one location, it seems to have less of an impact on the end-user than if we put it in other locations.” That’s really good data for us to know. And we found that, “Hey, if we put it in other locations, it doesn’t seem to have much of an impact on the user at all.” And so that’s really helpful for us because it helps us know where is it okay to make the page larger; where is it okay to put delays. Because we can use that opportunity to give the user more things.

Oh, I was just going to say one of my favorite quotes about performance is that the fastest page is an empty page, but it doesn’t solve any end-user needs. It’s all about what kind of costs are you willing to pay for what kinds of features. And this is to try and help us determine what is the base cost of making performance changes. Then we can use that when we’re trying to evaluate what kind of features we want to ship in the future.

JT: Some of that intuitively makes sense in terms of, for example, you would think that content if it’s, as it were, below the fold on the page, that being delayed in loading won’t be as obnoxious to a user as if the upper left-hand content is. Is that the kind of stuff you were seeing?

ES:: Exactly. Yeah. We looked at that and we did find exactly that. Although it did depend on what sort of content you were delaying below the page and how you had designed your page. So, for example, if you have things that you’re waiting to do until onLoad occurs, then even if you are doing a lot of stuff that’s below the fold, if it’s delaying onLoad, then it’s delaying some other functionality that might be important to the user. And so the overall architecture of the page may or may not be impacted by things that you’re doing below the fold. It kind of depends on what your features are. And so what we’ve done is measured in a bunch of our user scenarios, what kinds of things tend to impact the end-user.

JT: I wanted to turn for a second to the work you had done on the download site and how download sites seem to be evolving in general. Ten years ago, five years ago, you went to a download site. It was in HTTP or maybe an FTP link to the raw file. You got it. And you had it or you didn’t. It seems like most of the major sites have now gone to dedicated downloaders. Is that an acknowledgment that HTTP just isn’t up to a 3.5 gig file?

ES:: [Laughter] There’s a lot of reality in that. If you lose a packet along the way, most of the browsers don’t really support that well. There’s also the fact that a download application can end up opening a lot more connections to your server and can be intelligent about that. You also have things like BitTorrent that let you pull pieces of it from many different locations. And I was going around to some sites downloading some things last night and some of them used dedicated downloaders; some of them didn’t. I would say it really kind of depends on what you’re trying to download, but any time that you get over a certain sized file, it’s very hard, especially in a worldwide environment — there’s a strong chance of not having a successful download experience once that file gets over a certain size if you’re just using a browser. And in those cases, you don’t really have an alternative other than using some sort of tool that’s going to gracefully recover or be able to stitch pieces of many downloads together.

JT: I’ve been talking to Eric Schurman who’s in charge of site performance for Microsoft Live Search. He’ll be speaking at Velocity: Web Performance and Operations Conference on Measured Impacts of Page Slowdowns on Real Users, June 22-24, in San Jose, California. Thank you for taking the time to talk to us.

ES:: Sure. Thanks for calling me.

Velocity Preview – The Greatest Good for the Greatest Number at Microsoft

Get the O’Reilly Systems Engineering and Operations Newsletter