"EC2" entries

NoSQL Choices: To Misfit or Cargo Cult?

Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.

Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).

When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.

On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.

How Fit are You, Bro?

You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.

Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”

Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?

So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.

Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.

overfitting

fitting

In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
Read more…

Beyond Puppet and Chef: Managing PostgreSQL with Ansible

Velocity 2013 Speaker Series

Think configuration management is simply a decision between Chef or Puppet? PalaminoDB CTO (and Lead DB Engineer for Obama’s 2012 campaign) Jay Edwards (@meangrape) discusses his upcoming Velocity talk about Ansible, an alternative configuration management offering that is quick and easy to start using.

Key highlights include:

  • Unlike Puppet or Chef, Ansible has no notion of a centralized server. [Discussed at 1:30]
  • Ansible lets you get started more quickly and easily by doing everything via SSH. [Discussed at 2:12]
  • It’s also good for small-scale projects, such as home or personal things where no persistent state is required. [Discussed at 2:47]
  • Configuration in Ansible is all handled via markup in YAML files, so no domain-specific languages (DSL) or Ruby knowledge is required. [Discussed at 3:30]
  • Ansible is easily extensible in any language (not just Ruby). [Discussed at 4:50]
  • While it’s less relevant for someone with existing configuration management installations, Ansible could be useful in certain cases, such as Puppet without mcollective set up. [Discussed at 6:11]

You can watch the entire interview here:

This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

Defining clouds, web services, and other remote computing

Part 2 of the series, "What are the chances for a free software cloud?"

Technology commentators are a bit trapped by the term "cloud," which
has been kicked and slapped around enough to become truly shapeless.
So in this section I'll offer a history of services that have led up
to our cloud-obsessed era, hoping to help readers distinguish the
impacts and trade-offs created by all the trends that lie in the
"cloud."

Four short links: 5 November 2009

Four short links: 5 November 2009

Heat Maps in R, EC2 Blackhat Tricks, Snickersome Unicode, and Decoding Statistics

  1. Heat Maps in RWe used financial data here because it’s easier to access than the airline data, but it’s actually a pretty interesting way of looking at a financial time series. Weekend and holiday effects are a bit more obvious, and it’s a bit like being able to see the daily, weekly, monthly and yearly closes all at once (by scanning your eye over the calendar in different directions). Includes source code. (via migurski on Delicious)
  2. BlackHat and EC2Theft of resources is the red-headed step-child of attack classes and doesn’t get much attention, but on cloud platforms where resources are shared amongst many users these attacks can have a very real impact. With this in mind, we wanted to show how EC2 was vulnerable to a number of resource theft attacks and the videos below demonstrate three separate attacks against EC2 that permit an attacker to boot up massive numbers of machines, steal computing time/bandwidth from other users and steal paid-for AMIs. (via straup on Delicious)
  3. Funny Characters in Unicode — I never get tired of the wacky stuff in Unicode. I love the thought of a Unicode committee somewhere arguing passionately about the number of buttons on the snowman …. (via Hacker News)
  4. Statistics to English TranslationThe terms sensitivity and specificity generally refer to diagnostic or screening procedures, such as an HIV or allergy tests. The sensitivity of a test is its true positive rate; the specificity is its true negative rate, although it can be more intuitive to think of specificity as the complement of the false positive rate. This matters. Bandying around numbers with misleading labels, or misinterpreting numbers that have a precise and defined meaning, does not further understanding. (Said 78.4% of statisticians, with a 20% confidence factor probability of false positives)

 

Four short links: 16 October 2009

Four short links: 16 October 2009

Audio Geotagging, SF Open Data Stories, Wave Use Cases, Hadooped Genomes

  1. Wiimote Audio Geotagging — match audio with the map movement and annotations made with an IR pen and a Wiimote. Very cool! (and from New Zealand)
  2. San Francisco: Open For DataTwo months after it launched, the project is already reaping rewards from San Francisco’s huge community of programmers. Applications using the data include Routesy, which offers directions based on real-time city transport feeds; and EcoFinder, which points you to the nearest recycling site for a given item.
  3. Google Wave’s Best Use Cases (Lifehacker) — not cases where people are using Wave, but where they want to. Read this as “the Web has not provided all the tools to solve these problems”. Something will solve them, and Wave is trying to. (via Jim Stogdill)
  4. Analyzing Human Genomes with Hadoop — case study from the Cloudera blog. Performs alignment and genotyping on the 100GB of data you get when you sequence a human’s genome in about three hours for less than $100 using a 40-node, 320-core cluster rented from Amazon’s EC2. (via mndoci on Twitter)
Four short links: 8 October 2009

Four short links: 8 October 2009

DIY Baby Rocker, Unix Systems Glory, Encrypting Ephemera, and Explaining Creative Joy

  1. Linux Baby Rocker — inventive use of a CD drive and the eject command … (via Hacker News)
  2. I Like Unicorn Because It’s Unix — forceful rant about the need to rediscover Unix systems programming. Reminds me of the Varnish notes where the author explains that it works better because it uses the operating system instead of recreating it poorly.
  3. Encrypting Ephemeral Storage and EBS Volumes on Amazon — step-by-step instructions. (via Matt Biddulph on Delicious)
  4. You Have No Life if a video smacks even slightly of concentrated effort or advance planning, someone will inevitably scoff that the subject has a) “too much time on his hands” or b) “no life.” Ten times out of ten. […] After six years I lack a succinct, meaningful response to my students’ defensive, clannish embrace of mediocrity, though I’m grateful for this tweet, which comes pretty close: dwineman: You say “looks like somebody has too much time on their hands” but all I hear is “I’m sad because I don’t know what creativity feels like.”
Four short links: 22 September 2009

Four short links: 22 September 2009

Cities, How Things Work, Stylish Google, EC2 Numbers

  1. The City is a Battlesuit for Surviving the Future (IO9) — a great essay by Matt Jones, based on his talk at Webstock this year. Urban design is how we created alternate realities before we had iPhones, and the new technology lets us choose which science fiction future we want to inhabit. We are now a predominantly urban species, with over 50% of humanity living in a city. The overwhelming majority of these are not old post-industrial world cities such as London or New York, but large chaotic sprawls of the industrialising world such as the “maximum cities” of Mumbai or Guangzhou. Here the infrastructures are layered, ad-hoc, adaptive and personal – people there really are walking architecture, as Archigram said. Hacking post-industrial cities is becoming a necessity also. […]
  2. How and Why Machines Work (MIT Open Course Ware) — Subject studies how and why machines work, how they are conceived, how they are developed (drawn), and how they are utilized. Students learn from the hands-on experiences of taking things apart mentally and physically, drawing (sketching, 3D CAD) what they envision and observe, taking occasional field trips, and completing an individual term project (concept, creation, and presentation). Emphasis on understanding the physics and history of machines. (via Hacker News)
  3. Google Style Guide — how Google codes. Useful if you’re working on their code, starting a job there, or want to mock them for not specifying K&R braces/four space tabs/<insert One True Way here>. (via Hacker News)
  4. EC2 Usage Guessed From Sequential IDsThe Superseries ID changes so rarely that originally I had assumed it was some kind of checksum. This would have been odd as it limits the total available IDs to 224 = 16.8 million. Up to very recently, the Superseries ID for all resource types – instances, images, volumes, snapshots, etc. – was 69 (in the us-east-1 region (for eu-west-1 the Superseries ID is 74). These days, new instances use the Superseries ID 68. This subtle change, unnoticed by the industry, may hint at an astonishing achievement: 8.4 million instances launched since EC2’s debut! (Instance IDs are even so 8.4M = 16.8M / 2.) (via mattb on delicious)
Four short links: 24 June 2009

Four short links: 24 June 2009

Open Source Kids, Crowdsourcing Lessons, Flickr Secrets, Hadoop Spatial Joins

  1. The Digital OpenThe Digital Open is an online technology community and competition for youth around the world, age 17 and under. Building a community of young open source hackers.
  2. Four Crowdsoucing Lessons from the Guardian’s Spectacular Expenses Scandal ExperimentYour workers are unpaid, so make it fun. How to lure them? By making it feel like a game. “Any time that you’re trying to get people to give you stuff, to do stuff for you, the most important thing is that people know that what they’re doing is having an effect,” Willison said. “It’s kind of a fundamental tenet of social software. … If you’re not giving people the ‘I rock’ vibe, you’re not getting people to stick around.” (via migurski on delicious)
  3. 10+ Deploys/Day: Dev & Ops Cooperation at Flickr — John Allspaw and Paul Hammond’s talk from Velocity. You tell any mainstream company in the world “10 deploys/day” and you’ll be met with disbelief.
  4. Reproducing Spatial Joins using Hadoop and EC2 — bit by bit the techniques for emulating important operations from trad databases are being discovered and shared in the new database scene. (via straup on delicious)
Four short links: 11 June 2009

Four short links: 11 June 2009

Trends, Graffiti, Games, and Streaming Video

  1. Trending Topics — full source code for trendingtopics.org, Wikipedia trend analysis. Rails app running on the Cloudera Hadoop Distribution on EC2. (via mattb on Delicious)
  2. Graffiti from Pompeii — I can’t help but read these as Tweets. Herculaneum (on the exterior wall of a house); 10619: Apollinaris, the doctor of the emperor Titus, defecated well here (see also olde style Twitter) (via OvidPerl on Twitter)
  3. Online Games Dominate Beijing Startonomics — presentations from sessions on Chinese game business at Startonomics conference. Though there are many differences between the US and China games market, the one that stands out most is China’s ability to massively monetize games. Tencent, a leading Chinese web portal, social network and game developer, famously announced revenue of over $1 billion earlier this year, much of it coming from their avatar service. (via TinaTranT on Twitter)
  4. Ustream’s Audience for Apple iPhone Announcement Greater Than Cable News — Ustream is amazing, you can take a consumer handycam and video broadcast live to a greater audience than many TV shows get.
Four short links: 8 June 2009

Four short links: 8 June 2009

3D Geometry, The Printable Web, Government Internet Fail, and Real World Cloud Computing

  1. How to Project on 3D Geometry — the fine art (and math) of distorting an image so that it looks undistorted when projected onto a non-flat 3D surface. Confused? See the images below. (via straup on Delicious)
  2. ZinePal — Create your own printable magazine from any online content. (via warrenellis on Delicious)
  3. What The Government Doesn’t Understand About The Internet And What To Do About It — Tom Steinberg from MySociety lays it out. As true for US, NZ, and every other country as it is for the UK (for which it was written). Accept that any state institution that says “we control all the information about X” is going to look increasingly strange and frustrating to a public that’s used to be able to do whatever they want with information about themselves, or about anything they care about (both private and public). This means accepting that federated identity systems are coming and will probably be more successful than even official ID card systems: ditto citizen-held medical records. It means saying “We understand that letting train companies control who can interface with their ticketing systems means that the UK has awful train ticket websites that don’t work as hard as they should to help citizens buy cheaper tickets more easily. And we will change that, now.” What I like about Tom vs the US’s Gov 2.0 is that Tom puts down philosophy that’s hard to argue with, whereas the US is dangerously close to simply focusing on techniques and that’s subvertible.
  4. Real World Cloud Computing — summary from a panel of startups who are using EC2. The lock-in is latency. Transfering data within the Amazon services is free. Transfering data to an Amazon competitor: not free.

Sample distorted and undistorted images