- Reducing the Roots of Some Evil (Etsy) — Based on our first two months of data we have removed a number of unused CA certificates from some pilot systems to test the effects, and will run CAWatch for a full six months to build up a more comprehensive view of what CAs are in active use. Sign of how broken the CA system for SSL is. (via Alex Dong)
- Mind the Brain — PLOS podcast interviews Sci Foo alum and delicious neuroscience brain of awesome, Vaughan Bell. (via Fabiana Kubke)
- How Often are Ineffective Interventions Still Used in Practice? (PLOSone) — tl;dr: 8% of the time. Imagine the number if you asked how often ineffective software development practices are still used.
- Announcing Evan’s Awesome A/B Tools — I am calling these tools awesome because they are intuitive, visual, and easy-to-use. Unlike other online statistical calculators you’ve probably seen, they’ll help you understand what’s going on “under the hood” of common statistical tests, and by providing ample visual context, they make it easy for you to explain p-values and confidence intervals to your boss. (And they’re free!)
Distrusting CA Certs, Brain Talk, Ineffective Interventions, and Visual A/B Tools
Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
Exploiting Glass, Teaching Probability, Product Design, and Subgraph Matching
- Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4″ he clarified “XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units.
- Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
- Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
- The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. […] An implementation of ISMA in Java is freely available.
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
Moving beyond traditional tools makes data analysis faster and more powerful
Garrett Grolemund is an O’Reilly author and teaches classes on data analysis for R Studios.
We sat down to discuss why data scientists, statisticians, and programmers alike can use the R language to make data analysis easier and more powerful.
Key points from the full video (below) interview include:
- R is a free, open-source language that has its roots in S-PLUS [Discussed at the 0:27 mark]
- What does it mean for R to be a programming language versus just a data analysis tool? [Discussed at the 1:00 mark]
- R comes with many useful data analysis methods already implemented, so you don’t have to start from scratch. [Discussed at the 4:23 mark]
- R is a mix of functional and object-oriented programming that is optimal for handling data structures that data analysts expect (e.g. vectors) [Discussed at the 6:08 mark]
- A discussion of using R in conjunction with other languages like Python, along with packages that help with this [Discussed at the 7:30 mark]
- Getting started using R isn’t really any harder than using a calculator [Discussed at the 9:28 mark]
You can view the entire interview in the following video.
Enlightened Tinkering, In-Browser Tor Proxy, Dark Patterns, and Subjective Data
- Hands on Learning (HuffPo) — Unfortunately, engaged and enlightened tinkering is disappearing from contemporary American childhood. (via BoingBoing)
- Dark Patterns (Slideshare) — User interfaces to trick people. (via Beta Knowledge)
- Bill Gates is Naive: Data Are Not Objective (Math Babe) — examples at the end of biased models/data should be on the wall of everyone analyzing data. (via Karl Fisch)
SSH/L Multiplexer, GitHub Bots, Test Your Assumptions, and Tech Trends
- sslh — ssh/ssl multiplexer.
- Github Says No to Bots (Wired) — what’s interesting is that bots augmenting photos is awesome in Flickr: take a photo of the sky and you’ll find your photo annotated with stars and whatnot. What can GitHub learn from Flickr?
- Four Assumptions of Multiple Regression That Researchers Should Always Test — “but I found the answer I wanted! What do you mean, it might be wrong?!”
- Tenth Grade Tech Trends (Medium) — if you want to know what will have mass success, talk to early adopters in the mass market. We alpha geeks aren’t that any more.
- An Intuitive Guide to Linear Algebra — Here’s the linear algebra introduction I wish I had. I wish I’d had it, too. (via Hacker News)
- Think Bayes — an introduction to Bayesian statistics using computational methods.
- Divshot — a startup turning mockups into web apps, built on top of the Bootstrap front-end framework. I feel momentum and a tipping point approaching, where building things on the web is about to get easier again (the way it did with Ruby on Rails). cf Jetstrap.
ID-based Democracy, Web Documentation, American Telco Gouging, and Stats Cookbook
- Finland Crowdsourcing New Laws (GigaOm) — online referenda. The Finnish government enabled something called a “citizens’ initiative”, through which registered voters can come up with new laws – if they can get 50,000 of their fellow citizens to back them up within six months, then the Eduskunta (the Finnish parliament) is forced to vote on the proposal. Now this crowdsourced law-making system is about to go online through a platform called the Open Ministry. Petitions and online voting are notoriously prone to fraud, so it will be interesting to see how well the online identity system behind this holds up.
- WebPlatform — wiki of information about developing for the open web. Joint production of many of the $BIGCOs of the web and the W3C, so will be interesting to see, as it develops, whether it has the best aspects of each or the worst.
- Why Your Phone, Cable, Internet Bills Cost So Much (Yahoo) — “The companies essentially have a business model that is antithetical to economic growth,” he says. “Profits go up if they can provide slow Internet at super high prices.” Excellent piece!
- Probability and Statistics Cookbook (Matthias Vallentin) — The cookbook contains a succinct representation of various topics in probability theory and statistics. It provides a comprehensive reference reduced to the mathematical essence, rather than aiming for elaborate explanations. CC-BY-NC-SA licensed, LaTeX source on github.