"math" entries

Are your data normal? Hint: no.

One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.

There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.

But these tests are almost never what you really want. When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset. And that’s not a statistical test; it is a modeling decision.

All statistical analysis is based on models, and all models are based on simplifications. Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.

For example, the normal distribution is a good model for many physical quantities. The distribution of human height is approximately normal (see this previous blog post). But human heights are not normally distributed. For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions. But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.

Think about learning Bayes using Python

An interview with Allen Downey, the author of Think Bayes

Allen Downey

When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math.  I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.

How did the “Think” series come about? What led you to start the series?

Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.

I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.

At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.

How signals, geometry, and topology are influencing data science

Areas concerned with shapes, invariants, and dynamics, in high-dimensions, are proving useful in data analysis

I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)

Compressed Sensing
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.

The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.

Four short links: 24 May 2013

1. UbiquitySears Holdings has formed a new unit to market space from former Sears and Kmart retail stores as a home for data centers, disaster recovery space and wireless towers.
2. Google Abandons Open Standards for Instant Messaging (EFF) — it has to be a sign of the value to users of open standards that small companies embrace them and large companies reject them.
3. How Does Copyright Work in Space? (The Economist) — amazingly complex rights trail for the International Space Station-recorded cover of “Space Oddity”. Sample: Commander Hadfield and his son Evan spent several months hammering out details with Mr Bowie’s representatives, and with NASA, Russia’s space agency ROSCOSMOS and the CSA. That’s the SIMPLE HAPPY ENDING.
4. Great Lessons: Evan Weinberg’s “Do You Know Blue?” (Dan Meyer) — It’s a bridge from math to computer science. Students get a chance to write algorithms in a language understood by both mathematicians and the computer scientists. It’s analogous to the Netflix Prize for grown-up computer scientists.

Four short links: 13 May 2013

Exploiting Glass, Teaching Probability, Product Design, and Subgraph Matching

1. Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4″ he clarified “XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units.
2. Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
3. Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
4. The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. […] An implementation of ISMA in Java is freely available.

Four short links: 25 March 2013

Analytics vs Learning, Reproducible Science, Ramping up Military Internet Attacks, and Compressed Sensing

1. Analytics for LearningSince doing good learning analytics is hard, we often do easy learning analytics and pretend that they are good instead. But pretending doesn’t make it so. (via Dan Meyer)
2. Reproducible Research — a list of links to related work about reproducible research, reproducible research papers, etc. (via Stijn Debrouwere)
3. Pentagon Deploying 100+ Cyber TeamsThe organization defending military networks — cyber protection forces — will comprise more than 60 teams, a Pentagon official said. The other two organizations — combat mission forces and national mission forces — will conduct offensive operations. I’ll repeat that: offensive operations.
4. Towards Deterministic Compressed Sensing (PDF) — instead of taking lots of data, compressing by throwing some away, can we only take a few samples and reconstruct the original from that? (more mathematically sound than my handwaving explanation). See also Compressed sensing and big data from the Practical Quant. (via Ben Lorica)

Four short links: 4 March 2013

Inside the Aaron Swartz Investigation, Multivariate Dataset Exploration, Augmediated Life, and Public Experience

1. Life Inside the Aaron Swartz Investigationdo hard things and risk failure. What else are we on this earth for?
2. crossfilter — open source (Apache 2) JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records.
3. Steve Mann: My Augmediated Life (IEEE) — Until recently, most people tended to regard me and my work with mild curiosity and bemusement. Nobody really thought much about what this technology might mean for society at large. But increasingly, smartphone owners are using various sorts of augmented-reality apps. And just about all mobile-phone users have helped to make video and audio recording capabilities pervasive. Our laws and culture haven’t even caught up with that. Imagine if hundreds of thousands, maybe millions, of people had video cameras constantly poised on their heads. If that happens, my experiences should take on new relevance.
4. The Google Glass Feature No-One Is Talking AboutThe most important Google Glass experience is not the user experience – it’s the experience of everyone else. The experience of being a citizen, in public, is about to change.

Design matters more than math

Design compels. Math is proof. Both sides will defend their domains at Strata's next Great Debate.

At Strata Santa Clara later this month, we’re reprising what has become a tradition: Great Debates. These Oxford-style debates pit two teams against one another to argue a hot topic in the fields of big data, ubiquitous computing, and emerging interfaces.

Part of the fun is the scoring: attendees vote on whether they agree with the proposal before the debaters; and after both sides have said their piece, the audience votes again. Whoever moves the needle wins.

This year’s proposition — that design matters more than math — is sure to inspire some vigorous discussion. The argument for math is pretty strong. Math is proof. Given enough data — and today, we have plenty — we can know. “The right information in the right place just changes your life,” said Stewart Brand. Properly harnessed, the power of data analysis and modeling can fix cities, predict epidemics, and revitalize education. Abused, it can invade our lives, undermine economies, and steal elections. Surely the algorithms of big data matter!

But your life won’t change by itself. Bruce Mau defines design as “the human capacity to plan and produce desired outcomes.” Math informs; design compels. Without design, math can’t do its thing. Poorly designed experiments collect the wrong data. And if the data can’t be understood and acted upon, it may as well not have been crunched in the first place.

This is the question we’ll be putting to our debaters: Which matters more? A well-designed collection of flawed information — or an opaque, hard-to-parse, but unerringly accurate model? From mobile handsets to social policy, we need both good math and good design. Which is more critical? Read more…

Four short links: 7 February 2013

SCADA 0-Day, Complexity Course, ToS Tracking, and Custom Manufacturing Prostheses

1. Tridium Niagara (Wired) — A critical vulnerability discovered in an industrial control system used widely by the military, hospitals and others would allow attackers to remotely control electronic door locks, lighting systems, elevators, electricity and boiler systems, video surveillance cameras, alarms and other critical building facilities, say two security researchers. cf the SANS SCADA conference.
2. Santa Fe Institute Course: Introduction to Complexity — 11 week course on understanding complex systems: dynamics, chaos, fractals, information theory, self-organization, agent-based modeling, and networks. (via BoingBoing)
4. 3D Printing a Replacement Hand for a 5 Year Old Boy (Ars Technica) — the designs are on Thingiverse. For more, see their blog.