- What Every Computer Scientist Should Know About Floating Point Arithmetic — in short, “it will hurt you.”
- Ori — a distributed file system built for offline operation and empowers the user with control over synchronization operations and conflict resolution. We provide history through light weight snapshots and allow users to verify the history has not been tampered with. Through the use of replication instances can be resilient and recover damaged data from other nodes.
- RoboEarth — a Cloud Robotics infrastructure, which includes everything needed to close the loop from robot to the cloud and back to the robot. RoboEarth’s World-Wide-Web style database stores knowledge generated by humans – and robots – in a machine-readable format. Data stored in the RoboEarth knowledge base include software components, maps for navigation (e.g., object locations, world models), task knowledge (e.g., action recipes, manipulation strategies), and object recognition models (e.g., images, object models).
- Mother — domestic sensors and an app with an appallingly presumptuous name. (Also, wasn’t “Mother” the name of the ship computer in Alien?) (via BoingBoing)
Lessons from the design community for developing data-driven applications
When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.
Zombie Drones, Algebra Through Code, Data Toolkit, and Crowdsourcing Antibiotic Discovery
- Skyjack — drone that takes over other drones. Welcome to the Malware of Things.
- Bootstrap World — a curricular module for students ages 12-16, which teaches algebraic and geometric concepts through computer programming. (via Esther Wojicki)
- Harvest — open source BSD-licensed toolkit for building web applications for integrating, discovering, and reporting data. Designed for biomedical data first. (via Mozilla Science Lab)
- Project ILIAD — crowdsourced antibiotic discovery.
Coding for Unreliability, AirBnB JS Style, Category Theory, and Text Processing
- Quantitative Reliability of Programs That Execute on Unreliable Hardware (MIT) — As MIT’s press release put it: Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output. (via Pete Warden)
- Category Theory for Scientists (MIT Courseware) — Scooby snacks for rationalists.
- Textblob — Python open source text processing library with sentiment analysis, PoS tagging, term extraction, and more.
New Math, Business Math, Summarising Text, Clipping Images
- Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It (Jennifer Ouellette) — Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway.
- Is Google Jumping the Shark? (Seth Godin) — Public companies almost inevitably seek to grow profits faster than expected, which means beyond the organic growth that comes from doing what made them great in the first place. In order to gain that profit, it’s typical to hire people and reward them for measuring and increasing profits, even at the expense of what the company originally set out to do. Eloquent redux.
- textteaser — open source text summarisation algorithm.
- Clipping Magic — Instantly create masks, cutouts, and clipping paths online.
One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.
Here’s the introduction:
In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.
When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.
While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.
Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.
One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.
There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.
But these tests are almost never what you really want. When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset. And that’s not a statistical test; it is a modeling decision.
All statistical analysis is based on models, and all models are based on simplifications. Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.
For example, the normal distribution is a good model for many physical quantities. The distribution of human height is approximately normal (see this previous blog post). But human heights are not normally distributed. For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions. But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.
An interview with Allen Downey, the author of Think Bayes
When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math. I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.
How did the “Think” series come about? What led you to start the series?
Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.
I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.
At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.