"statistics" entries

Simpler workflow tools enable the rapid deployment of models

The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms

Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.

A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.

Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).

Read more…

R as a Programming Language

Moving beyond traditional tools makes data analysis faster and more powerful

Garrett Grolemund is an O’Reilly author and teaches classes on data analysis for R Studios.

We sat down to discuss why data scientists, statisticians, and programmers alike can use the R language to make data analysis easier and more powerful.

Key points from the full video (below) interview include:

  • R is a free, open-source language that has its roots in S-PLUS [Discussed at the 0:27 mark]
  • What does it mean for R to be a programming language versus just a data analysis tool? [Discussed at the 1:00 mark]
  • R comes with many useful data analysis methods already implemented, so you don’t have to start from scratch. [Discussed at the 4:23 mark]
  • R is a mix of functional and object-oriented programming that is optimal for handling data structures that data analysts expect (e.g. vectors) [Discussed at the 6:08 mark]
  • A discussion of using R in conjunction with other languages like Python, along with packages that help with this [Discussed at the 7:30 mark]
  • Getting started using R isn’t really any harder than using a calculator [Discussed at the 9:28 mark]

You can view the entire interview in the following video.

Related:

Four short links: 4 February 2013

Four short links: 4 February 2013

Enlightened Tinkering, In-Browser Tor Proxy, Dark Patterns, and Subjective Data

  1. Hands on Learning (HuffPo) — Unfortunately, engaged and enlightened tinkering is disappearing from contemporary American childhood. (via BoingBoing)
  2. FlashProxy (Stanford) — a miniature proxy that runs in a web browser. It checks for clients that need access, then conveys data between them and a Tor relay. […] If your browser runs JavaScript and has support for WebSockets then while you are viewing this page your browser is a potential proxy available to help censored Internet users.
  3. Dark Patterns (Slideshare) — User interfaces to trick people. (via Beta Knowledge)
  4. Bill Gates is Naive: Data Are Not Objective (Math Babe) — examples at the end of biased models/data should be on the wall of everyone analyzing data. (via Karl Fisch)
Four short links: 4 January 2013

Four short links: 4 January 2013

SSH/L Multiplexer, GitHub Bots, Test Your Assumptions, and Tech Trends

  1. sslh — ssh/ssl multiplexer.
  2. Github Says No to Bots (Wired) — what’s interesting is that bots augmenting photos is awesome in Flickr: take a photo of the sky and you’ll find your photo annotated with stars and whatnot. What can GitHub learn from Flickr?
  3. Four Assumptions of Multiple Regression That Researchers Should Always Test — “but I found the answer I wanted! What do you mean, it might be wrong?!”
  4. Tenth Grade Tech Trends (Medium) — if you want to know what will have mass success, talk to early adopters in the mass market. We alpha geeks aren’t that any more.
Four short link: 27 November 2012

Four short link: 27 November 2012

Faking with Stats, Praising Coworkers, Medium Explained, and SIGGraph Trailer

  1. Statistical Misdirection Master Class — examples from Fox News. The further through the list you go, the more horrifying^Wedifying they are. Some are clearly classics from the literature, but some are (as far as I can tell) newly developed graphical “persuasion” techniques.
  2. Wall of Awesome — give your coworkers some love.
  3. Dave Winer on Medium — Dave hits some interesting points: Users can create new buckets or collections and call them anything they want. A bucket is analogous to a blog post. Then other people can post to it. That’s like a comment. But it doesn’t look like a comment. It’s got a place for a big image at the top. It looks much prettier than a comment, and much bigger. Looks are important here.
  4. SIGGraph Asia Trailer (YouTube) — resuiting Sims and rotating city blocks, at the end, were my favourite. (via Andy Baio)
Four short links: 10 October 2012

Four short links: 10 October 2012

Intuitive Linear Algebra, Bayes Intro, State of Javascript, and Web App Builders

  1. An Intuitive Guide to Linear AlgebraHere’s the linear algebra introduction I wish I had. I wish I’d had it, too. (via Hacker News)
  2. Think Bayesan introduction to Bayesian statistics using computational methods.
  3. The State of Javascript 2012 (Brendan Eich) — Javascript continues its march up and down the stack, simultaneously becoming an application language while becoming the bytecode for the world.
  4. Divshot — a startup turning mockups into web apps, built on top of the Bootstrap front-end framework. I feel momentum and a tipping point approaching, where building things on the web is about to get easier again (the way it did with Ruby on Rails). cf Jetstrap.
Four short links: 9 October 2012

Four short links: 9 October 2012

ID-based Democracy, Web Documentation, American Telco Gouging, and Stats Cookbook

  1. Finland Crowdsourcing New Laws (GigaOm) — online referenda. The Finnish government enabled something called a “citizens’ initiative”, through which registered voters can come up with new laws – if they can get 50,000 of their fellow citizens to back them up within six months, then the Eduskunta (the Finnish parliament) is forced to vote on the proposal. Now this crowdsourced law-making system is about to go online through a platform called the Open Ministry. Petitions and online voting are notoriously prone to fraud, so it will be interesting to see how well the online identity system behind this holds up.
  2. WebPlatform — wiki of information about developing for the open web. Joint production of many of the $BIGCOs of the web and the W3C, so will be interesting to see, as it develops, whether it has the best aspects of each or the worst.
  3. Why Your Phone, Cable, Internet Bills Cost So Much (Yahoo) — “The companies essentially have a business model that is antithetical to economic growth,” he says. “Profits go up if they can provide slow Internet at super high prices.” Excellent piece!
  4. Probability and Statistics Cookbook (Matthias Vallentin) — The cookbook contains a succinct representation of various topics in probability theory and statistics. It provides a comprehensive reference reduced to the mathematical essence, rather than aiming for elaborate explanations. CC-BY-NC-SA licensed, LaTeX source on github.

Statwing simplifies data analysis

Quickly perform and interpret the results of routine Small Data analysis

With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer.

Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages, it can already do many basic “data analysis” tasks.

Consider the following example of a pivot table constructed in Excel: this required 8 mouse-clicks, if you do everything perfectly, and about 5 decisions (what variables to include, what metric to use, …)

The same task in Statwing required 4 mouse-clicks and 0 decisions! Plus it comes with visuals:

The lack of clutter and the addition of a simple “headline” (“Female tends to have much higher values for satisfaction than Male“), makes the result much easier to interpret. The advanced tab contains detailed statistical analysis (in this case the p-value, counts, values). Many users get confused by the output/results produced by traditional statistical software. Let’s face it, many analysts have had little training in statistics. I welcome a tool that produces readily interpretable results.

The company hopes to replicate the above example across a wide variety of routine data analysis tasks. Their initial focus is on tools for (consumer) survey analysis, a potentially huge market given that online companies have made surveys so much easier to conduct. Users of Statwing pay a small monthly subscription, making it cheaper than most2 statistical packages. For a small monthly fee, their intuitive UI lets analysts get their tasks done quickly. More importantly Statwing may nurture aspiring data scientists in your organization.


(1) As this recent Strata presentation points out: Spreadsheets are the glue that keeps many organizations together.

(2) Open source tools like OpenOffice, R and Octave are free. So is the use of Google spreadsheets.

Digging into the UDID data

The UDID story has conflicting theories, so the only real thing we have to work with is the data.

Over the weekend the hacker group Antisec released one million UDID records that they claim to have obtained from an FBI laptop using a Java vulnerability. In reply the FBI stated:

The FBI is aware of published reports alleging that an FBI laptop was compromised and private data regarding Apple UDIDs was exposed. At this time there is no evidence indicating that an FBI laptop was compromised or that the FBI either sought or obtained this data.

Of course that statement leaves a lot of leeway. It could be the agent’s personal laptop, and the data may well have been “property” of an another agency. The wording doesn’t even explicitly rule out the possibility that this was an agency laptop, they just say that right now they don’t have any evidence to suggest that it was.

This limited data release doesn’t have much impact, but the possible release of the full dataset, which is claimed to include names, addresses, phone numbers and other identifying information, is far more worrying.

While there are some almost dismissing the issue out of hand, the real issues here are: Where did the data originate? Which devices did it come from and what kind of users does this data represent? Is this data from a cross-section of the population, or a specifically targeted demographic? Does it originate within the law enforcement community, or from an external developer? What was the purpose of the data, and why was it collected?

With conflicting stories from all sides, the only thing we can believe is the data itself. The 40-character strings in the release at least look like UDID numbers, and anecdotally at least we have a third-party confirmation that this really is valid UDID data. We therefore have to proceed at this point as if this is real data. While there is a possibility that some, most, or all of the data is falsified, that’s looking unlikely from where we’re standing standing at the moment.

Read more…

Four short links: 8 August 2012

Four short links: 8 August 2012

Reading Minds, Satellites in the Cloud, Units for Risk, and Valuing Autism

  1. Reconstructing Visual Experiences (PDF) — early visual areas represent the information in movies. To demonstrate the power of our approach, we also constructed a Bayesian decoder by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of the viewed movies. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.
  2. Earth Engine — satellite imagery and API for coding against it, to do things like detecting deforestation, classifying land cover, estimating forest biomass and carbon, and mapping the world’s roadless areas.
  3. Microlives — 30m of your life expectancy. Here are some things that would, on average, cost a 30-year-old man 1 microlife: Smoking 2 cigarettes; Drinking 7 units of alcohol (eg 2 pints of strong beer); Each day of being 5 Kg overweight. A chest X-ray will set a middle-aged person back around 2 microlives, while a whole body CT-scan would weigh in at around 180 microlives.
  4. Autistics Need Opportunities More Than Treatment — Laurent gave a powerful talk at Sci Foo: if the autistic brain is better at pattern matching, find jobs where that’s useful. Like, say, science. The autistic woman who was delivering mail became a research assistant in his lab, now has papers galore to her name for original research.