"statistics" entries

Four short links: 1 September 2015

Four short links: 1 September 2015

People Detection, Ratings Patterns, Inspection Bias, and Cloud Filesystem

  1. End-to-End People Detection in Crowded Scenes — research paper and code. When parsing the title, bind “end-to-end” to “scenes” not “people”.
  2. Statistical Patterns in Movie Ratings (PLOSone) — We find that the distribution of votes presents scale-free behavior over several orders of magnitude, with an exponent very close to 3/2, with exponential cutoff. It is remarkable that this pattern emerges independently of movie attributes such as average rating, age and genre, with the exception of a few genres and of high-budget films.
  3. The Inspection Bias is EverywhereIn 1991, Scott Feld presented the “friendship paradox”: the observation that most people have fewer friends than their friends have. He studied real-life friends, but the same effect appears in online networks: if you choose a random Facebook user, and then choose one of their friends at random, the chance is about 80% that the friend has more friends. The friendship paradox is a form of the inspection paradox. When you choose a random user, every user is equally likely. But when you choose one of their friends, you are more likely to choose someone with a lot of friends. Specifically, someone with x friends is overrepresented by a factor of x.
  4. s3qla file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X. (GPLv3)
Comment
Four short links: 15 July 2015

Four short links: 15 July 2015

OpeNSAurce, Multimaterial Printing, Functional Javascript, and Outlier Detection

  1. System Integrity Management Platform (Github) — NSA releases security compliance tool for government departments.
  2. 3D-Printed Explosive Jumping Robot Combines Firm and Squishy Parts (IEEE Spectrum) — Different parts of the robot grade over three orders of magnitude from stiff like plastic to squishy like rubber, through the use of nine different layers of 3D printed materials.
  3. Professor Frisby’s Mostly Adequate Guide to Functional Programming — a book on functional programming, using Javascript as the programming language.
  4. Tracking Down Villains — the software and algorithms that Netflix uses to detect outliers in their infrastructure monitoring.
Comment
Four short links: 23 June 2015

Four short links: 23 June 2015

Irregular Periodicity, Facebook Beacons, Industry 4.0, and Universal Container

  1. Fast Lomb-Scargle Periodograms in Pythona classic method for finding periodicity in irregularly-sampled data.
  2. Facebook Bluetooth Beacons — free for you to use and help people see more information about your business whenever they use Facebook during their visit.
  3. Industry 4.0 — stop gagging at the term. Interesting examples of connectivity and data improving manufacturing. Human-machine interfaces: Logistics company Knapp AG developed a picking technology using augmented reality. Pickers wear a headset that presents vital information on a see-through display, helping them locate items more quickly and precisely. And with both hands free, they can build stronger and more efficient pallets, with fragile items safeguarded. An integrated camera captures serial and lot ID numbers for real-time stock tracking. Error rates are down by 40%, among many other benefits. Digital-to-physical transfer: Local Motors builds cars almost entirely through 3-D printing, with a design crowdsourced from an online community. It can build a new model from scratch in a year, far less than the industry average of six. Vauxhall and GM, among others, still bend a lot of metal, but also use 3-D printing and rapid prototyping to minimize their time to market. (via Quartz)
  4. runCa lightweight universal runtime container, by the Open Container Project. (OCP = multi-vendor initiative in hands of Linux Foundation)
Comment
Four short links: 22 June 2015

Four short links: 22 June 2015

Power Analysis, Data at Scale, Open Source Fail, and Closing the Virtuous Loop

  1. Power Analysis of a Typical Psychology Experiment (Tom Stafford) — What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.
  2. The Future of Data at ScaleData curation, on the other hand, is “the 800-pound gorilla in the corner,” says Stonebraker. “You can solve your volume problem with money. You can solve your velocity problem with money. Curation is just plain hard.” The traditional solution of extract, transform, and load (ETL) works for 10, 20, or 30 data sources, he says, but it doesn’t work for 500. To curate data at scale, you need automation and a human domain expert.
  3. Why Are We Still Explaining? (Stephen Walli) — Within 24 hours we received our first righteous patch. A simple 15-line change that provided a 10% boost in Just-in-Time compiler performance. And we politely thanked the contributor and explained we weren’t accepting changes yet. Another 24 hours and we received the first solid bug fix. It was golden. It included additional tests for the test suite to prove it was fixed. And we politely thanked the contributor and explained we weren’t accepting changes yet. And that was the last thing that was ever contributed.
  4. Blood Donors in Sweden Get a Text Message When Their Blood Helps Someone (Independent) — great idea to close the feedback loop. If you want to get more virtuous behaviour, make it a relationship and not a transaction. And if a warm feeling is all you have to offer in return, then offer it!
Comment
Four short links: 3 June 2015

Four short links: 3 June 2015

Filter Design, Real-Time Analytics, Neural Turing Machines, and Evaluating Subjective Opinions

  1. How to Design Applied FiltersThe most frequently observed issue during usability testing were filtering values changing placement when the user applied them – either to another position in the list of filtering values (typically the top) or to an “Applied filters” summary overview. During testing, the subjects were often confounded as they noticed that the filtering value they just clicked was suddenly “no longer there.”
  2. Twitter Herona real-time analytics platform that is fully API-compatible with Storm […] At Twitter, Heron is used as our primary streaming system, running hundreds of development and production topologies. Since Heron is efficient in terms of resource usage, after migrating all Twitter’s topologies to it we’ve seen an overall 3x reduction in hardware, causing a significant improvement in our infrastructure efficiency.
  3. ntman implementation of neural Turing machines. (via @fastml_extra)
  4. Bayesian Truth Seruma scoring system for eliciting and evaluating subjective opinions from a group of respondents, in situations where the user of the method has no independent means of evaluating respondents’ honesty or their ability. It leverages respondents’ predictions about how other respondents will answer the same questions. Through these predictions, respondents reveal their meta-knowledge, which is knowledge of what other people know.
Comment
Four short links: 7 April 2015

Four short links: 7 April 2015

JavaScript Numeric Methods, Misunderstood Statistics, Web Speed, and Sentiment Analysis

  1. NumericJS — numerical methods in JavaScript.
  2. P Values are not Error Probabilities (PDF) — In particular, we illustrate how this mixing of statistical testing methodologies has resulted in widespread confusion over the interpretation of p values (evidential measures) and α levels (measures of error). We demonstrate that this confusion was a problem between the Fisherian and Neyman–Pearson camps, is not uncommon among statisticians, is prevalent in statistics textbooks, and is well nigh universal in the pages of leading (marketing) journals. This mass confusion, in turn, has rendered applications of classical statistical testing all but meaningless among applied researchers.
  3. Breaking the 1000ms Time to Glass Mobile Barrier (YouTube) —
    See also slides. Stay under 250 ms to feel “fast.” Stay under 1000 ms to keep users’ attention.
  4. Modern Methods for Sentiment AnalysisRecently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Gentle introduction, with code.
Comment: 1
Four short links: 26 March 2015

Four short links: 26 March 2015

GPU Graph Algorithms, Data Sharing, Build Like Google, and Distributed Systems Theory

  1. gunrocka CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future development of high-performance GPU graph primitives. (via Ben Lorica)
  2. How to Share Data with a Statisticiansome instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis.
  3. Bazela build tool, i.e. a tool that will run compilers and tests to assemble your software, similar to Make, Ant, Gradle, Buck, Pants, and Maven. Google’s build tool, to be precise.
  4. You Can’t Have Exactly-Once Delivery — not about the worst post office ever. FLP and the Two Generals Problem are not design complexities, they are impossibility results.
Comment
Four short links: 18 February 2015

Four short links: 18 February 2015

Sales Automation, Clone Boxes, Stats Style, and Extra Orifices

  1. Systematising Sales with Software and Processes — sweet use of Slack as UI for sales tools.
  2. Duplicate SSH Keys EverywhereIt looks like all devices with the fingerprint are Dropbear SSH instances that have been deployed by Telefonica de Espana. It appears that some of their networking equipment comes set up with SSH by default, and the manufacturer decided to reuse the same operating system image across all devices.
  3. Style.ONS — UK govt style guide covers the elements of writing about statistics. It aims to make statistical content more open and understandable, based on editorial research and best practice. (via Hadley Beeman)
  4. Warren Ellis on the Apple WatchI, personally, want to put a gold chain on my phone, pop it into a waistcoat pocket, and refer to it as my “digital fob watch” whenever I check the time on it. Just to make the point in as snotty and high-handed a way as possible: This is the decadent end of the current innovation cycle, the part where people stop having new ideas and start adding filigree and extra orifices to the stuff we’ve got and call it the future.
Comment
Four short links: 19 December 2014

Four short links: 19 December 2014

Statistical Causality, Clustering Bitcoin, Hardware Security, and A Language for Scripts

  1. Distinguishing Cause and Effect using Observational Data — research paper evaluating effectiveness of the “additive noise” test, a nifty statistical trick to identify causal relationships from observational data. (via Slashdot)
  2. Clustering Bitcoin Accounts Using Heuristics (O’Reilly Radar) — In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. […] It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms.
  3. A Primer on Hardware Security: Models, Methods, and Metrics (PDF) — Camouflaging: This is a layout-level technique to hamper image-processing-based extraction of gate-level netlist. In one embodiment of camouflaging, the layouts of standard cells are designed to look alike, resulting in incorrect extraction of the netlist. The layout of nand cell and the layout of nor cell look different and hence their functionality can be extracted. However, the layout of a camouflaged nand cell and the layout of camouflaged nor cell can be made to look identical and hence an attacker cannot unambiguously extract their functionality.
  4. Prompter: A Domain-Specific Language for Versu (PDF) — literally a scripting language (you write theatrical-style scripts, characters, dialogues, and events) for an inference engine that lets you talk to characters and have a different story play out each time.
Comment
Four short links: 21 October 2014

Four short links: 21 October 2014

Data Delusions, OS Robotics, Insecure Crypto, and Free Icons

  1. The Delusions of Big Data (IEEE) — When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
  2. ROSCON 2014 — slides and videos of talks from Chicago open source robotics conference.
  3. Making Sure Crypto Stays Insecure (PDF) — Daniel J. Bernstein talk: This talk is actually a thought experiment: how could an attacker manipulate the ecosystem for insecurity?
  4. Material Design Icons — Google’s CC-licensed (attribution, sharealike) collection of sweet, straightforward icons.
Comment