Statistical Patterns in Movie Ratings (PLOSone) — We find that the distribution of votes presents scale-free behavior over several orders of magnitude, with an exponent very close to 3/2, with exponential cutoff. It is remarkable that this pattern emerges independently of movie attributes such as average rating, age and genre, with the exception of a few genres and of high-budget films.
The Inspection Bias is Everywhere — In 1991, Scott Feld presented the “friendship paradox”: the observation that most people have fewer friends than their friends have. He studied real-life friends, but the same effect appears in online networks: if you choose a random Facebook user, and then choose one of their friends at random, the chance is about 80% that the friend has more friends. The friendship paradox is a form of the inspection paradox. When you choose a random user, every user is equally likely. But when you choose one of their friends, you are more likely to choose someone with a lot of friends. Specifically, someone with x friends is overrepresented by a factor of x.
s3ql — a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X. (GPLv3)
Facebook Bluetooth Beacons — free for you to use and help people see more information about your business whenever they use Facebook during their visit.
Industry 4.0 — stop gagging at the term. Interesting examples of connectivity and data improving manufacturing. Human-machine interfaces: Logistics company Knapp AG developed a picking technology using augmented reality. Pickers wear a headset that presents vital information on a see-through display, helping them locate items more quickly and precisely. And with both hands free, they can build stronger and more efficient pallets, with fragile items safeguarded. An integrated camera captures serial and lot ID numbers for real-time stock tracking. Error rates are down by 40%, among many other benefits. Digital-to-physical transfer: Local Motors builds cars almost entirely through 3-D printing, with a design crowdsourced from an online community. It can build a new model from scratch in a year, far less than the industry average of six. Vauxhall and GM, among others, still bend a lot of metal, but also use 3-D printing and rapid prototyping to minimize their time to market. (via Quartz)
runC — a lightweight universal runtime container, by the Open Container Project. (OCP = multi-vendor initiative in hands of Linux Foundation)
Power Analysis of a Typical Psychology Experiment (Tom Stafford) — What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.
The Future of Data at Scale — Data curation, on the other hand, is “the 800-pound gorilla in the corner,” says Stonebraker. “You can solve your volume problem with money. You can solve your velocity problem with money. Curation is just plain hard.” The traditional solution of extract, transform, and load (ETL) works for 10, 20, or 30 data sources, he says, but it doesn’t work for 500. To curate data at scale, you need automation and a human domain expert.
Why Are We Still Explaining? (Stephen Walli) — Within 24 hours we received our first righteous patch. A simple 15-line change that provided a 10% boost in Just-in-Time compiler performance. And we politely thanked the contributor and explained we weren’t accepting changes yet. Another 24 hours and we received the first solid bug fix. It was golden. It included additional tests for the test suite to prove it was fixed. And we politely thanked the contributor and explained we weren’t accepting changes yet. And that was the last thing that was ever contributed.
How to Design Applied Filters — The most frequently observed issue during usability testing were filtering values changing placement when the user applied them – either to another position in the list of filtering values (typically the top) or to an “Applied filters” summary overview. During testing, the subjects were often confounded as they noticed that the filtering value they just clicked was suddenly “no longer there.”
Twitter Heron — a real-time analytics platform that is fully API-compatible with Storm […] At Twitter, Heron is used as our primary streaming system, running hundreds of development and production topologies. Since Heron is efficient in terms of resource usage, after migrating all Twitter’s topologies to it we’ve seen an overall 3x reduction in hardware, causing a significant improvement in our infrastructure efficiency.
Bayesian Truth Serum — a scoring system for eliciting and evaluating subjective opinions from a group of respondents, in situations where the user of the method has no independent means of evaluating respondents’ honesty or their ability. It leverages respondents’ predictions about how other respondents will answer the same questions. Through these predictions, respondents reveal their meta-knowledge, which is knowledge of what other people know.
P Values are not Error Probabilities (PDF) — In particular, we illustrate how this mixing of statistical testing methodologies has resulted in widespread confusion over the interpretation of p values (evidential measures) and α levels (measures of error). We demonstrate that this confusion was a problem between the Fisherian and Neyman–Pearson camps, is not uncommon among statisticians, is prevalent in statistics textbooks, and is well nigh universal in the pages of leading (marketing) journals. This mass confusion, in turn, has rendered applications of classical statistical testing all but meaningless among applied researchers.
Modern Methods for Sentiment Analysis — Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Gentle introduction, with code.
gunrock — a CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future development of high-performance GPU graph primitives. (via Ben Lorica)
How to Share Data with a Statistician — some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis.
Bazel — a build tool, i.e. a tool that will run compilers and tests to assemble your software, similar to Make, Ant, Gradle, Buck, Pants, and Maven. Google’s build tool, to be precise.
Duplicate SSH Keys Everywhere — It looks like all devices with the fingerprint are Dropbear SSH instances that have been deployed by Telefonica de Espana. It appears that some of their networking equipment comes set up with SSH by default, and the manufacturer decided to reuse the same operating system image across all devices.
Style.ONS — UK govt style guide covers the elements of writing about statistics. It aims to make statistical content more open and understandable, based on editorial research and best practice. (via Hadley Beeman)
Warren Ellis on the Apple Watch — I, personally, want to put a gold chain on my phone, pop it into a waistcoat pocket, and refer to it as my “digital fob watch” whenever I check the time on it. Just to make the point in as snotty and high-handed a way as possible: This is the decadent end of the current innovation cycle, the part where people stop having new ideas and start adding filigree and extra orifices to the stuff we’ve got and call it the future.
Clustering Bitcoin Accounts Using Heuristics (O’Reilly Radar) — In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. […] It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms.
A Primer on Hardware Security: Models, Methods, and Metrics (PDF) — Camouflaging: This is a layout-level technique to hamper image-processing-based extraction of gate-level netlist. In one embodiment of camouflaging, the layouts of standard cells are designed to look alike, resulting in incorrect extraction of the netlist. The layout of nand cell and the layout of nor cell look different and hence their functionality can be extracted. However, the layout of a camouflaged nand cell and the layout of camouflaged nor cell can be made to look identical and hence an attacker cannot unambiguously extract their functionality.
Prompter: A Domain-Specific Language for Versu (PDF) — literally a scripting language (you write theatrical-style scripts, characters, dialogues, and events) for an inference engine that lets you talk to characters and have a different story play out each time.
The Delusions of Big Data (IEEE) — When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
ROSCON 2014 — slides and videos of talks from Chicago open source robotics conference.
Making Sure Crypto Stays Insecure (PDF) — Daniel J. Bernstein talk: This talk is actually a thought experiment: how could an attacker manipulate the ecosystem for insecurity?
Material Design Icons — Google’s CC-licensed (attribution, sharealike) collection of sweet, straightforward icons.