ENTRIES TAGGED "Velocity"

Building an Alerting System That Really Works

Velocity 2013 Speaker Series

Building a high quality alerting system often feels like a dark art. Often it is hard to set the proper thresholds and it is even harder to define when an alert should be triggered or not. This results in alerts being raised too early or too late and your colleagues losing faith in the system. Once you use a structured approach to build an alerting system you will find it much easier and the alerts more predictable and precise.

Measure Selection

First you have to select proper measures to alert on. This selection is key as all other steps depend on using meaningful measures. While there seems to be an infinite number of different measures, you can categorize them into three main categories:

  • Saturation measures indicate how much of a resource is used. Examples are CPU usage or resource pool consumption.
  • Occurrence measures indicate whether a condition was met or not. A good example is errors. These measures are often presented as a rate like failed transactions per seconds.
  • Continuous measures do not have a single value at any given point in time, but instead a large number of different values. A typical example is response times. Irrespective of how small you make the sample, you will always have a large amount of values and never just one single representative value.

Read more…

Comment: 1

Sharing is a competitive advantage

Why the Velocity conference is coming to New York.

In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.

But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes.

Read more…

Comment

Sharing is a competitive advantage

Why the Velocity conference is coming to New York.

In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.

But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes. Read more…

Comment: 1

The web performance I want

Cruftifying web pages is not what Velocity is about.

There’s been a lot said and written about web performance since the Velocity conference. And steps both forward and back — is the web getting faster? Are developers using increased performance to add more useless gunk to their pages, taking back performance gains almost as quickly as they’re achieved?

I don’t want to leap into that argument; Arvind Jain did a good job of discussing the issues at Velocity Santa Clara and in a blog post on Google’s analytics site. But, I do want to discuss (all right, flame) about one issue that bugs me.

I see a lot of pages that appear to load quickly. I click on a site, and within a second, I have an apparently readable page.

“Apparently,” however, is a loaded word because a second later, some new component of the page loads, causing the browser to re-layout the page, so everything jumps around. Then comes the pop-over screen, asking if I want to subscribe or take a survey. (Most online renditions of print magazines: THIS MEANS YOU!). Then another resize, as another component appears. If I want to scroll down past the lead picture, which is usually uninteresting, I often find that I can’t because the browser is still laying out bits and pieces of the page. It’s almost as if the developers don’t want me to read the page. That’s certainly the effect they achieve.

Read more…

Comments: 4
Four short links: 6 August 2013

Four short links: 6 August 2013

Modern Security Ethics, Punk'd Chinese Cyberwarriors, Web Tracing, and Lightweight Server OS

  1. White Hat’s Dilemma (Google Docs) — amazeballs preso with lots of tough ethical questions for people in the computer field.
  2. Chinese Hacking Team Caught Taking Over Decoy Water Plant (MIT Tech Review) — Wilhoit went on to show evidence that other hacking groups besides APT1 intentionally seek out and compromise water plant systems. Between March and June this year, 12 honeypots deployed across eight different countries attracted 74 intentional attacks, 10 of which were sophisticated enough to wrest complete control of the dummy control system.
  3. Web Tracing FrameworkRich tools for instrumenting, analyzing, and visualizing web apps.
  4. CoreOSLinux kernel + systemd. That’s about it. CoreOS has just enough bits to run containers, but does not ship a package manager itself. In fact, the root partition is completely read-only, to guarantee consistency and make updates reliable. Docker-compatible.
Comment

Using Iframes to Address Third-Party Script Issues and Boost Performance

SOASTA chief architect Philip Tellis talks about ways developers and third-party script authors can use iframes.

In the following interview, Philip Tellis, chief architect at SOASTA, talks about how iframes can be used to address performance and security issues with third-party scripts, and how the element can help third-party script owners make use of far-future expires headers. Tellis will address these issues in-depth in his upcoming Velocity session, “Improving 3rd Party Script Performance With IFrames.”

How can iframes be used to boost performance?

PhilipTellis

Philip Tellis

Philip Tellis: Iframes haven’t traditionally been good for performance. Sub-pages loaded in iframes still block the loading of the main page. Too many iframes hurt performance in the same way as too many images or scripts do. The problem is slightly worse with iframes because each page loaded in an iframe may load its own resources, each of which competes with the main page for available bandwidth.

All of this assumes that (1) the iframe is loaded within the page before the onload event fires, and (2) its src attribute points to a resource that needs to be downloaded. If we prevent either of these two conditions from happening, an iframe doesn’t have a performance penalty. We can then take advantage of the fact that the iframe is a complete browser window instance, and you can run pretty much anything you want in there without affecting the main page. This is great if you need to download and cache resources like JavaScript, images and CSS without blocking the page’s onload event or force a cache reload.

The three ways to reduce perceived latency in any system are to cache, parallelise, and predict, and iframes allow us to do all three without impacting the main page.

Read more…

Comment: 1

Ins and Outs of Running MySQL on AWS

Laine Campbell on why AWS is a good platform option for running MySQL at scale

In the following interview, PalominoDB owner and CEO Laine Campbell discusses advantages and disadvantages of using Amazon Web Services (AWS) as a platform for running MySQL. The solution provides a functional environment for young startups who can’t afford a database administrator (DBA), Campbell says, but there are drawbacks to be aware of, such as a lack of access to your database’s file system, and troubleshooting “can get quite hairy.” This interview is a sneak preview to Campbell’s upcoming Velocity session, “Using Amazon Web Services for MySQL at Scale.”

Why is AWS a good platform for scaling MySQL?

Laine Campbell

Laine Campbell

Laine Campbell: The elasticity of Amazon’s cloud service is key to scaling on most tiers in an application’s infrastructure, and this is true with MySQL as well. Concurrency is a recurring pattern with MySQL’s scaling capabilities, and as traffic and concurrent queries grow, one has to introduce some fairly traditional scaling patterns. One such pattern is adding replicas to distribute read I/O and reduce contention and concurrency, which is easy to do with rapid deployment of new instances and Elastic Block Storage (EBS) snapshots.

Additionally, sharding can be done with less impact via EBS snapshots being used to recreate the dataset, and then data that is not part of the new shard is removed. Amazon’s relational database service for MySQL—RDS—is also a new, rather compelling scaling pattern for the early stages of a company’s life, when resources are scarce and administrators have not been hired. RDS is a great pattern for people to emulate in terms of rapid deployment of replicas, ease of master failovers, and the ability to easily redeploy hosts when errors occur, rather than spending extensive time trying to repair or clean up data.

Read more…

Comment

Test-driven Infrastructure with Chef

Velocity 2013 Speaker Series

If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.

Key highlights from our discussion include:

  • There are not currently any standards regarding testing with Chef.  [Discussed at 1:09]
  • A recommended workflow that starts with unit testing  [Discussed at 2:11]
  • Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
  • In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
  • Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]

You can watch the full interview here:

 

Comment

Application Resilience in a Service-oriented Architecture

Velocity 2013 Speaker Series

Failure Isolation and Operations with Hystrix

Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.

At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.

blocked-requests-640

This open source library follows these principles in protecting our systems when novel failures inevitably occur:

  • Isolate client network interaction using the bulkhead and circuit breaker patterns.
  • Fallback and degrade gracefully when possible.
  • Fail fast when fallbacks aren’t available and rapidly recover.
  • Monitor, alert and push configuration changes with low latency (seconds).

 
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.

failing-dependency-640

Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Read more…

Comment

Mobile-centric Optimization Requires a Mobile-centric Approach

Appurify co-founders Manish Lachwani and Jay Srinivasan talk about the motivation behind their platform and the solutions it provides.

As our always-on society turns more and more to mobile platforms and devices—a recent Global Mobile Data Traffic Forecast predicted 788 million mobile-only Internet users by 2015—mobile app development is becoming more and more important. Developers, however, are finding mobile measurement and optimization toolsets lacking, which is increasingly becoming an issue as mobile users show low tolerance for buggy apps.

Appurify co-founders Manish Lachwani and Jay Srinivasan experienced these challenges first hand and launched a solution. The duo will demo their Appurify performance-optimization platform during the Lightning Demos at the upcoming Velocity conference. In the following interview, Lachwani and Srinivasan talk about the motivation behind Appurify and offer a sneak peek at what we can expect to see at their demo.

What are some of the key challenges developers face in measuring app performance?

Jay Srinivasan

Jay Srinivasan

Jay Srinivasan: Mobile performance measurement and optimization is broken today. This is a three-fold problem: there are no good tools, the mobile space is complex, and mobile users demand exceptional performance in all conditions.

More specifically, most performance measurement and optimization tools that exist for the web and PC world simply don’t exist for mobile. This is both due to the mobile ecosystem being relatively young as well as the added tech complexity that working with mobile devices offers. Compounding this lack of tools is the complexity of the mobile environment. Mobile is much more fragmented from an operating system, device, and firmware perspective, and optimizations can vary depending on the environment. Mobile users are also more demanding, with the expectation that they can use their smartphones or tablets in an always-on, always-connected environment. Your mobile app needs to load quickly and perform seamlessly in all network and device conditions.

Read more…

Comments: 2