- The Uncanny Valley of Speech Recognition (Zach Holman) — I’m reminded of driving up US-280 in 2003 or so with @raelity, a Kiwi and a South African trying every permutation of American accent from Kentucky to Yosemite Sam in order to get TellMe to stop giving us the weather for zipcode 10000. It didn’t recognise the swearing either. (Caution: features similarly strong language.)
- TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries (PDF) — an integrated PAQ [Predictive Analytic Queries] planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The resulting system, TUPAQ, solves the PAQ planning problem with comparable accuracy to exhaustive strategies but an order of magnitude faster, and can scale to models trained on terabytes of data across hundreds of machines.
- p2pvc — point-to-point video chat. In an 80×25 terminal window.
- Sortable — nifty UI library.
Liza Kindred on the evolving role of data in fashion and the growing relationship between tech and fashion companies.
In this podcast episode, I talk with Liza Kindred, founder of Third Wave Fashion and author of the new free report “Fashioning Data: How fashion industry leaders innovate with data and what you can learn from what they know.” Kindred addresses the evolving role data and analytics are playing in the fashion industry, and the emerging connections between technology and fashion companies. “One of the things that fashion is doing better than maybe any other industry,” Kindred says, “is facilitating conversations with users.”
Gathering and analyzing user data creates opportunities for the fashion and tech industries alike. One example of this is the trend toward customization. Read more…
A practical example of how anomaly detection makes complex data problems easier to solve.
As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.
At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.
Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:
- Anomalies are defined not by their own characteristics but in contrast to what is normal.
- Before you can spot an anomaly, you first have to figure out what “normal” actually is.
This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…
Collecting actionable data is a challenge for today's data tools
One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:
- Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
- The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
- The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
- Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.
Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.