More Tools for Managing and Reproducing Complex Data Projects (Ben Lorica) — As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users).
A/A Testing — In an A/A test, you run a test using the exact same options for both “variants” in your test. That’s right, there’s no difference between “A” and “B” in an A/A test. It sounds stupid, until you see the “results.” (via Nelson Minar)
NSA Declares War on General-Purpose Computing (BoingBoing) — NSA director Michael S Rogers says his agency wants “front doors” to all cryptography used in the USA, so that no one can have secrets it can’t spy on — but what he really means is that he wants to be in charge of which software can run on any general purpose computer.
P Values are not Error Probabilities (PDF) — In particular, we illustrate how this mixing of statistical testing methodologies has resulted in widespread confusion over the interpretation of p values (evidential measures) and α levels (measures of error). We demonstrate that this confusion was a problem between the Fisherian and Neyman–Pearson camps, is not uncommon among statisticians, is prevalent in statistics textbooks, and is well nigh universal in the pages of leading (marketing) journals. This mass confusion, in turn, has rendered applications of classical statistical testing all but meaningless among applied researchers.
Modern Methods for Sentiment Analysis — Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Gentle introduction, with code.
Duplicate SSH Keys Everywhere — It looks like all devices with the fingerprint are Dropbear SSH instances that have been deployed by Telefonica de Espana. It appears that some of their networking equipment comes set up with SSH by default, and the manufacturer decided to reuse the same operating system image across all devices.
Style.ONS — UK govt style guide covers the elements of writing about statistics. It aims to make statistical content more open and understandable, based on editorial research and best practice. (via Hadley Beeman)
Warren Ellis on the Apple Watch — I, personally, want to put a gold chain on my phone, pop it into a waistcoat pocket, and refer to it as my “digital fob watch” whenever I check the time on it. Just to make the point in as snotty and high-handed a way as possible: This is the decadent end of the current innovation cycle, the part where people stop having new ideas and start adding filigree and extra orifices to the stuff we’ve got and call it the future.
Clustering Bitcoin Accounts Using Heuristics (O’Reilly Radar) — In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. […] It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms.
A Primer on Hardware Security: Models, Methods, and Metrics (PDF) — Camouflaging: This is a layout-level technique to hamper image-processing-based extraction of gate-level netlist. In one embodiment of camouflaging, the layouts of standard cells are designed to look alike, resulting in incorrect extraction of the netlist. The layout of nand cell and the layout of nor cell look different and hence their functionality can be extracted. However, the layout of a camouflaged nand cell and the layout of camouflaged nor cell can be made to look identical and hence an attacker cannot unambiguously extract their functionality.
Prompter: A Domain-Specific Language for Versu (PDF) — literally a scripting language (you write theatrical-style scripts, characters, dialogues, and events) for an inference engine that lets you talk to characters and have a different story play out each time.
Failing at Microservices — deconstructed a failed stab at microservices. Category three engineers also presented a significant problem to our implementation. In many cases, these engineers implemented services incorrectly; in one example, an engineer had literally wrapped and hosted one microservice within another because he didn’t understand how the services were supposed to communicate if they were in separate processes (or on separate machines). These engineers also had a tough time understanding how services should be tested, deployed, and monitored because they were so used to the traditional “throw the service over the fence”to an admin approach to deployment. This basically lead to huge amounts of churn and loss of productivity.
Dynamics of Correlated Novelties (Nature) — paper on “the adjacent possible”. Here we propose a simple mathematical model that mimics the process of exploring a physical, biological, or conceptual space that enlarges whenever a novelty occurs. The model, a generalization of Polya’s urn, predicts statistical laws for the rate at which novelties happen (Heaps’ law) and for the probability distribution on the space explored (Zipf’s law), as well as signatures of the process by which one novelty sets the stage for another. (via Steven Strogatz)
NYC’s Dollar Vans (New Yorker) — New York’s unofficial shuttles, called “dollar vans” in some neighborhoods, make up a thriving transportation system that operates where the subway and buses don’t. A somewhat invisible economy. (via Seb Chan)
Eigenmorality — caution: linear algebra and morality, two subjects that many programmers struggle with. (via Pete Warden)
101 Uses for Content Mining — between the list in the post and the comments from readers, it’s a good introduction to some of the value to be obtained from full-text structured and unstructured access to scientific research publications.