- White Hat’s Dilemma (Google Docs) — amazeballs preso with lots of tough ethical questions for people in the computer field.
- Chinese Hacking Team Caught Taking Over Decoy Water Plant (MIT Tech Review) — Wilhoit went on to show evidence that other hacking groups besides APT1 intentionally seek out and compromise water plant systems. Between March and June this year, 12 honeypots deployed across eight different countries attracted 74 intentional attacks, 10 of which were sophisticated enough to wrest complete control of the dummy control system.
- Web Tracing Framework — Rich tools for instrumenting, analyzing, and visualizing web apps.
- CoreOS — Linux kernel + systemd. That’s about it. CoreOS has just enough bits to run containers, but does not ship a package manager itself. In fact, the root partition is completely read-only, to guarantee consistency and make updates reliable. Docker-compatible.
Cruftifying web pages is not what Velocity is about.
There’s been a lot said and written about web performance since the Velocity conference. And steps both forward and back — is the web getting faster? Are developers using increased performance to add more useless gunk to their pages, taking back performance gains almost as quickly as they’re achieved?
I don’t want to leap into that argument; Arvind Jain did a good job of discussing the issues at Velocity Santa Clara and in a blog post on Google’s analytics site. But, I do want to discuss (all right, flame) about one issue that bugs me.
I see a lot of pages that appear to load quickly. I click on a site, and within a second, I have an apparently readable page.
“Apparently,” however, is a loaded word because a second later, some new component of the page loads, causing the browser to re-layout the page, so everything jumps around. Then comes the pop-over screen, asking if I want to subscribe or take a survey. (Most online renditions of print magazines: THIS MEANS YOU!). Then another resize, as another component appears. If I want to scroll down past the lead picture, which is usually uninteresting, I often find that I can’t because the browser is still laying out bits and pieces of the page. It’s almost as if the developers don’t want me to read the page. That’s certainly the effect they achieve.
Modern Security Ethics, Punk'd Chinese Cyberwarriors, Web Tracing, and Lightweight Server OS
SOASTA chief architect Philip Tellis talks about ways developers and third-party script authors can use iframes.
In the following interview, Philip Tellis, chief architect at SOASTA, talks about how iframes can be used to address performance and security issues with third-party scripts, and how the element can help third-party script owners make use of far-future expires headers. Tellis will address these issues in-depth in his upcoming Velocity session, “Improving 3rd Party Script Performance With IFrames.”
How can iframes be used to boost performance?Philip Tellis: Iframes haven’t traditionally been good for performance. Sub-pages loaded in iframes still block the loading of the main page. Too many iframes hurt performance in the same way as too many images or scripts do. The problem is slightly worse with iframes because each page loaded in an iframe may load its own resources, each of which competes with the main page for available bandwidth.
The three ways to reduce perceived latency in any system are to cache, parallelise, and predict, and iframes allow us to do all three without impacting the main page.
Laine Campbell on why AWS is a good platform option for running MySQL at scale
In the following interview, PalominoDB owner and CEO Laine Campbell discusses advantages and disadvantages of using Amazon Web Services (AWS) as a platform for running MySQL. The solution provides a functional environment for young startups who can’t afford a database administrator (DBA), Campbell says, but there are drawbacks to be aware of, such as a lack of access to your database’s file system, and troubleshooting “can get quite hairy.” This interview is a sneak preview to Campbell’s upcoming Velocity session, “Using Amazon Web Services for MySQL at Scale.”
Why is AWS a good platform for scaling MySQL?
Laine Campbell: The elasticity of Amazon’s cloud service is key to scaling on most tiers in an application’s infrastructure, and this is true with MySQL as well. Concurrency is a recurring pattern with MySQL’s scaling capabilities, and as traffic and concurrent queries grow, one has to introduce some fairly traditional scaling patterns. One such pattern is adding replicas to distribute read I/O and reduce contention and concurrency, which is easy to do with rapid deployment of new instances and Elastic Block Storage (EBS) snapshots.
Additionally, sharding can be done with less impact via EBS snapshots being used to recreate the dataset, and then data that is not part of the new shard is removed. Amazon’s relational database service for MySQL—RDS—is also a new, rather compelling scaling pattern for the early stages of a company’s life, when resources are scarce and administrators have not been hired. RDS is a great pattern for people to emulate in terms of rapid deployment of replicas, ease of master failovers, and the ability to easily redeploy hosts when errors occur, rather than spending extensive time trying to repair or clean up data.
Velocity 2013 Speaker Series
If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.
Key highlights from our discussion include:
- There are not currently any standards regarding testing with Chef. [Discussed at 1:09]
- A recommended workflow that starts with unit testing [Discussed at 2:11]
- Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
- In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
- Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]
You can watch the full interview here:
Velocity 2013 Speaker Series
Failure Isolation and Operations with Hystrix
Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.
At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.
This open source library follows these principles in protecting our systems when novel failures inevitably occur:
- Isolate client network interaction using the bulkhead and circuit breaker patterns.
- Fallback and degrade gracefully when possible.
- Fail fast when fallbacks aren’t available and rapidly recover.
- Monitor, alert and push configuration changes with low latency (seconds).
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.
Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Appurify co-founders Manish Lachwani and Jay Srinivasan talk about the motivation behind their platform and the solutions it provides.
As our always-on society turns more and more to mobile platforms and devices—a recent Global Mobile Data Traffic Forecast predicted 788 million mobile-only Internet users by 2015—mobile app development is becoming more and more important. Developers, however, are finding mobile measurement and optimization toolsets lacking, which is increasingly becoming an issue as mobile users show low tolerance for buggy apps.
Appurify co-founders Manish Lachwani and Jay Srinivasan experienced these challenges first hand and launched a solution. The duo will demo their Appurify performance-optimization platform during the Lightning Demos at the upcoming Velocity conference. In the following interview, Lachwani and Srinivasan talk about the motivation behind Appurify and offer a sneak peek at what we can expect to see at their demo.
What are some of the key challenges developers face in measuring app performance?
Jay Srinivasan: Mobile performance measurement and optimization is broken today. This is a three-fold problem: there are no good tools, the mobile space is complex, and mobile users demand exceptional performance in all conditions.
More specifically, most performance measurement and optimization tools that exist for the web and PC world simply don’t exist for mobile. This is both due to the mobile ecosystem being relatively young as well as the added tech complexity that working with mobile devices offers. Compounding this lack of tools is the complexity of the mobile environment. Mobile is much more fragmented from an operating system, device, and firmware perspective, and optimizations can vary depending on the environment. Mobile users are also more demanding, with the expectation that they can use their smartphones or tablets in an always-on, always-connected environment. Your mobile app needs to load quickly and perform seamlessly in all network and device conditions.
Velocity 2013 Speaker Series
At some point, we’ve all ended up trading horror stories over drinks with colleagues. Heads nod and shake in sympathy, and the stories get hairier as the night goes on. And while it of course feels good to get some of that dirt off your shoulder, is there a larger, better purpose to sharing war stories? I sat down with James Turnbull of Puppet Labs (@kartar) to chat about his upcoming Velocity talk about Ops mythology, and how we might be able to turn our tales of disaster into triumph.
Key highlights of our discussion include:
- Why do we share disaster stories? What is the attraction? [Discussed at 0:40]
- Stories are about shared experience and bonding with members of our community. [Discussed at 2:10]
- These horror stories are like mythological “big warnings” that help enforce social order, which isn’t always a good thing. [Discussed at 4:18]
- A preview of how his talk will be about moving away from the bad stories so people can keep telling more good stories. (Also: s’mores.) [Discussed at 7:15]
You can watch the entire interview here:
This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.
Tech events you don't want to miss
Each Monday, we round up upcoming event highlights from the programming and technology spaces. Have an event to share? Send us a note.
Twisted Python: the engine of your Internet webcast: Jessica McKellar presents an architectural overview of the Python networking library, Twisted, and instructs on how to build robust clients and servers for popular and custom network protocols. Register for this free webcast.
Date: 10 a.m. PT, June 6 Location: Online webcast
2 Day Hadoop Training June 2013: This course offers a fast-paced technical overview of the Hadoop landscape, targeted toward both technical and non-technical people who want to understand the emerging world of big data. For more information and to register, visit the event page.
Date: June 8–9 Location: Sunnyvale, CA
Velocity 2013 Speaker Series
Why should we at all bother about notions such as risk and safety in web operations? Do web operations face risk? Do web operations manage risk? Do web operations produce risk? Last Christmas Eve, Amazon had an AWS outage affecting a variety of actors, including Netflix, which was a service included in many of the gifts shared on that very day. The event has introduced the notion of risk into the discourse of web operations, and it might then be good timing for some reflective thoughts on the very nature of risk in this domain.
What is risk? The question is a classic one, and the answer is tightly coupled to how one views the nature of the incident occurring as a result of the risk.
One approach to assessing the risk of Amazon going down is probabilistic: start by laying out the entire space of potential scenarios leading to Amazon going down, calculate their probability, and multiply the probability for each scenario by their estimated severity (likely in terms of the costs connected to the specific scenario depending on the time of the event). Each scenario can then be plotted in a risk matrix showing their weighted ranking (to prioritize future risk mitigation measures) or calculated as a collective sum of the risks for each scenario (to judge whether the risk for Amazon going down is below a certain acceptance criterion).
This first way of answering the question of what the risk is for Amazon to go down is intimately linked with a perception of risk as energy to be kept contained (Haddon, 1980). This view originates from more recent times of increased development of process industries in which clearly graspable energies (fuel rods at nuclear plants, the fossil fuels at refineries, the kinetic energy of an aircraft) are to be kept contained and safely separated from a vulnerable target such as human beings. The next question of importance here becomes how to avoid an uncontrolled release of the contained energy. The strategies for mitigating the risk of an uncontrolled release of energy are basically two: barriers and redundancy (and the two combined: redundancy of barriers). Physically graspable energies can be contained through the use of multiple barriers (called “defenses in depth”) and potentially several barriers of the same kind (redundancy), for instance several emergency-cooling systems for a nuclear plant.
Using this metaphor, the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.
Controlling risk by analyzing the complete space of possible (and graspable) scenarios basically does not distinguish between safety and reliability. From this view, a system is safe when it is reliable, and the reliability of each barrier can be calculated. However there is one system component that is more difficult to grasp in terms of reliability than any other: the human. Inevitably, proponents of the energy/barrier model of risk end up explaining incidents (typically accidents) in terms of unreliable human beings not guaranteeing the safety (reliability) of the inherently safe (risk controlled by reliable barriers) system. I think this problem—which has its own entire literature connected to it—is too big to outline in further detail in this blog post, but let me point you towards a few references: Dekker, 2005; Dekker, 2006; Woods, Dekker, Cook, Johannesen & Sarter, 2009. The only issue is these (and most other citations in this post) are all academic tomes, so for those who would prefer a shorter summary available online, I can refer you to this report. I can also reassure you that I will get back to this issue in my keynote speech at the Velocity conference next month. To put the critique short: the contemporary literature questions the view of humans as the unreliable component of inherently safe systems, and instead advocates a view of humans as the only ones guaranteeing safety in inherently complex and risky environments.