- Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It (Jennifer Ouellette) — Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway.
- Is Google Jumping the Shark? (Seth Godin) — Public companies almost inevitably seek to grow profits faster than expected, which means beyond the organic growth that comes from doing what made them great in the first place. In order to gain that profit, it’s typical to hire people and reward them for measuring and increasing profits, even at the expense of what the company originally set out to do. Eloquent redux.
- textteaser — open source text summarisation algorithm.
- Clipping Magic — Instantly create masks, cutouts, and clipping paths online.
New Math, Business Math, Summarising Text, Clipping Images
One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.
Here’s the introduction:
In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.
When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.
While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.
Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.
One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.
There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.
But these tests are almost never what you really want. When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset. And that’s not a statistical test; it is a modeling decision.
All statistical analysis is based on models, and all models are based on simplifications. Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.
For example, the normal distribution is a good model for many physical quantities. The distribution of human height is approximately normal (see this previous blog post). But human heights are not normally distributed. For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions. But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.
An interview with Allen Downey, the author of Think Bayes
When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math. I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.
How did the “Think” series come about? What led you to start the series?
Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.
I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.
At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.
Areas concerned with shapes, invariants, and dynamics, in high-dimensions, are proving useful in data analysis
I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.
The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.
Repurposing Dead Retail Space, Open Standards, Space Copyright, and Bridging Lessons
- Ubiquity — Sears Holdings has formed a new unit to market space from former Sears and Kmart retail stores as a home for data centers, disaster recovery space and wireless towers.
- Google Abandons Open Standards for Instant Messaging (EFF) — it has to be a sign of the value to users of open standards that small companies embrace them and large companies reject them.
- How Does Copyright Work in Space? (The Economist) — amazingly complex rights trail for the International Space Station-recorded cover of “Space Oddity”. Sample: Commander Hadfield and his son Evan spent several months hammering out details with Mr Bowie’s representatives, and with NASA, Russia’s space agency ROSCOSMOS and the CSA. That’s the SIMPLE HAPPY ENDING.
- Great Lessons: Evan Weinberg’s “Do You Know Blue?” (Dan Meyer) — It’s a bridge from math to computer science. Students get a chance to write algorithms in a language understood by both mathematicians and the computer scientists. It’s analogous to the Netflix Prize for grown-up computer scientists.
Exploiting Glass, Teaching Probability, Product Design, and Subgraph Matching
- Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4” he clarified “XE4, not XE3”, which I verified. He had thought this feature had been removed from the production units.
- Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
- Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
- The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. […] An implementation of ISMA in Java is freely available.
Analytics vs Learning, Reproducible Science, Ramping up Military Internet Attacks, and Compressed Sensing
- Analytics for Learning — Since doing good learning analytics is hard, we often do easy learning analytics and pretend that they are good instead. But pretending doesn’t make it so. (via Dan Meyer)
- Reproducible Research — a list of links to related work about reproducible research, reproducible research papers, etc. (via Stijn Debrouwere)
- Pentagon Deploying 100+ Cyber Teams — The organization defending military networks — cyber protection forces — will comprise more than 60 teams, a Pentagon official said. The other two organizations — combat mission forces and national mission forces — will conduct offensive operations. I’ll repeat that: offensive operations.
- Towards Deterministic Compressed Sensing (PDF) — instead of taking lots of data, compressing by throwing some away, can we only take a few samples and reconstruct the original from that? (more mathematically sound than my handwaving explanation). See also Compressed sensing and big data from the Practical Quant. (via Ben Lorica)
Inside the Aaron Swartz Investigation, Multivariate Dataset Exploration, Augmediated Life, and Public Experience
- Life Inside the Aaron Swartz Investigation — do hard things and risk failure. What else are we on this earth for?
- Steve Mann: My Augmediated Life (IEEE) — Until recently, most people tended to regard me and my work with mild curiosity and bemusement. Nobody really thought much about what this technology might mean for society at large. But increasingly, smartphone owners are using various sorts of augmented-reality apps. And just about all mobile-phone users have helped to make video and audio recording capabilities pervasive. Our laws and culture haven’t even caught up with that. Imagine if hundreds of thousands, maybe millions, of people had video cameras constantly poised on their heads. If that happens, my experiences should take on new relevance.
- The Google Glass Feature No-One Is Talking About — The most important Google Glass experience is not the user experience – it’s the experience of everyone else. The experience of being a citizen, in public, is about to change.
Design compels. Math is proof. Both sides will defend their domains at Strata's next Great Debate.
At Strata Santa Clara later this month, we’re reprising what has become a tradition: Great Debates. These Oxford-style debates pit two teams against one another to argue a hot topic in the fields of big data, ubiquitous computing, and emerging interfaces.
Part of the fun is the scoring: attendees vote on whether they agree with the proposal before the debaters; and after both sides have said their piece, the audience votes again. Whoever moves the needle wins.
This year’s proposition — that design matters more than math — is sure to inspire some vigorous discussion. The argument for math is pretty strong. Math is proof. Given enough data — and today, we have plenty — we can know. “The right information in the right place just changes your life,” said Stewart Brand. Properly harnessed, the power of data analysis and modeling can fix cities, predict epidemics, and revitalize education. Abused, it can invade our lives, undermine economies, and steal elections. Surely the algorithms of big data matter!
But your life won’t change by itself. Bruce Mau defines design as “the human capacity to plan and produce desired outcomes.” Math informs; design compels. Without design, math can’t do its thing. Poorly designed experiments collect the wrong data. And if the data can’t be understood and acted upon, it may as well not have been crunched in the first place.
This is the question we’ll be putting to our debaters: Which matters more? A well-designed collection of flawed information — or an opaque, hard-to-parse, but unerringly accurate model? From mobile handsets to social policy, we need both good math and good design. Which is more critical? Read more…