- Reform Government Surveillance — hard not to view this as a demarcation dispute. “Ruthlessly collecting every detail of online behaviour is something we do clandestinely for advertising purposes, it shouldn’t be corrupted because of your obsession over national security!”
- Brian Abelson — Data Scientist at the New York Times, blogging what he finds. He tackles questions like what makes a news app “successful” and how might we measure it. Found via this engaging interview at the quease-makingly named Content Strategist.
- StageXL — Flash-like 2D package for Dart.
- BayesDB — lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Open source.
Surveillance Demarcation, NYT Data Scientist, 2D Dart, and Bayesian Database
R GUI, Drone Regulations, Bitcoin Stats, and Android/iOS Money Shootout
- Deducer — An R Graphical User Interface (GUI) for Everyone.
- Integration of Civil Unmanned Aircraft Systems (UAS) in the National Airspace System (NAS) Roadmap (PDF, FAA) — first pass at regulatory framework for drones. (via Anil Dash)
- Bitcoin Stats — $21MM traded, $15MM of electricity spent mining. Goodness. (via Steve Klabnik)
- iOS vs Android Numbers (Luke Wroblewski) — roundup comparing Android to iOS in recent commerce writeups. More Android handsets, but less revenue per download/impression/etc.
Warrant Canary, Polluted Statistics, Dollars for Deathbots, and Protocol Madness
- Apple Transparency Report (PDF) — contains a warrant canary, the statement Apple has never received an order under Section 215 of the USA Patriot Act. We would expect to challenge an order if served on us which will of course be removed if one of the secret orders is received. Bravo, Apple, for implementing a clever hack to route around excessive secrecy. (via Boing Boing)
- You’re Probably Polluting Your Statistics More Than You Think — it is insanely easy to find phantom correlations in random data without obviously being foolish. Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this. (via Stijn Debrouwere)
- CyPhy Funded (Quartz) — the second act of iRobot co-founder Helen Greiner, maker of the famed Roomba robot vacuum cleaner. She terrified ETech long ago—the audience were expecting Roomba cuteness and got a keynote about military deathbots. It would appear she’s still in the deathbot niche, not so much with the cute. Remember this when you build your OpenCV-powered recoil-resistant load-bearing-hoverbot and think it’ll only ever be used for the intended purpose of launching fertiliser pellets into third world hemp farms.
- User-Agent String History — a light-hearted illustration of why the formal semantic value of free-text fields is driven to zero in the face of actual use.
- Android Guides — lots of info on coding for Android.
- Statistics Done Wrong — learn from these failure modes. Not medians or means. Modes.
- Streaming, Sketching, and Sufficient Statistics (YouTube) — how to process huge data sets as they stream past your CPU (e.g., those produced by sensors). (via Ben Lorica)
Publishing Bad Research, Reproducing Research, DIY Police Scanner, and Inventing the Future
- Science Not as Self-Correcting As It Thinks (Economist) — REALLY good discussion of the shortcomings in statistical practice by scientists, peer-review failures, and the complexities of experimental procedure and fuzziness of what reproducibility might actually mean.
- Reproducibility Initiative Receives Grant to Validate Landmark Cancer Studies — The key experimental findings from each cancer study will be replicated by experts from the Science Exchange network according to best practices for replication established by the Center for Open Science through the Center’s Open Science Framework, and the impact of the replications will be tracked on Mendeley’s research analytics platform. All of the ultimate publications and data will be freely available online, providing the first publicly available complete dataset of replicated biomedical research and representing a major advancement in the study of reproducibility of research.
- $20 SDR Police Scanner — using software-defined radio to listen to the police band.
- Reimagine the Chemistry Set — $50k prize in contest to design a “chemistry set” type kit that will engage kids as young as 8 and inspire people who are 88. We’re looking for ideas that encourage kids to explore, create, build and question. We’re looking for ideas that honor kids’ curiosity about how things work. Backed by the Moore Foundation and Society for Science and the Public.
Distrusting CA Certs, Brain Talk, Ineffective Interventions, and Visual A/B Tools
- Reducing the Roots of Some Evil (Etsy) — Based on our first two months of data we have removed a number of unused CA certificates from some pilot systems to test the effects, and will run CAWatch for a full six months to build up a more comprehensive view of what CAs are in active use. Sign of how broken the CA system for SSL is. (via Alex Dong)
- Mind the Brain — PLOS podcast interviews Sci Foo alum and delicious neuroscience brain of awesome, Vaughan Bell. (via Fabiana Kubke)
- How Often are Ineffective Interventions Still Used in Practice? (PLOSone) — tl;dr: 8% of the time. Imagine the number if you asked how often ineffective software development practices are still used.
- Announcing Evan’s Awesome A/B Tools — I am calling these tools awesome because they are intuitive, visual, and easy-to-use. Unlike other online statistical calculators you’ve probably seen, they’ll help you understand what’s going on “under the hood” of common statistical tests, and by providing ample visual context, they make it easy for you to explain p-values and confidence intervals to your boss. (And they’re free!)
Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
Exploiting Glass, Teaching Probability, Product Design, and Subgraph Matching
- Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4″ he clarified “XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units.
- Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
- Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
- The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. […] An implementation of ISMA in Java is freely available.
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
Moving beyond traditional tools makes data analysis faster and more powerful
Garrett Grolemund is an O’Reilly author and teaches classes on data analysis for R Studios.
We sat down to discuss why data scientists, statisticians, and programmers alike can use the R language to make data analysis easier and more powerful.
Key points from the full video (below) interview include:
- R is a free, open-source language that has its roots in S-PLUS [Discussed at the 0:27 mark]
- What does it mean for R to be a programming language versus just a data analysis tool? [Discussed at the 1:00 mark]
- R comes with many useful data analysis methods already implemented, so you don’t have to start from scratch. [Discussed at the 4:23 mark]
- R is a mix of functional and object-oriented programming that is optimal for handling data structures that data analysts expect (e.g. vectors) [Discussed at the 6:08 mark]
- A discussion of using R in conjunction with other languages like Python, along with packages that help with this [Discussed at the 7:30 mark]
- Getting started using R isn’t really any harder than using a calculator [Discussed at the 9:28 mark]
You can view the entire interview in the following video.