- US Providers Must Divulge from Offshore Servers (Gigaom) — A U.S. magistrate judge ruled that U.S. cloud vendors must fork over customer data even if that data resides in data centers outside the country. (via Alistair Croll)
- Inside Google’s Self-Driving Car (Atlantic Cities) — Urmson says the value of maps is one of the key insights that emerged from the DARPA challenges. They give the car a baseline expectation of its environment; they’re the difference between the car opening its eyes in a completely new place and having some prior idea what’s going on around it. This is a long and interesting piece on the experience and the creator’s concerns around the self-driving cars. Still looking for the comprehensive piece on the subject.
- Recent Robotics-Relate IPOs — not all the exits are to Google.
- How One Woman Hid Her Pregnancy From Big Data (Mashable) — “I really couldn’t have done it without Tor, because Tor was really the only way to manage totally untraceable browsing. I know it’s gotten a bad reputation for Bitcoin trading and buying drugs online, but I used it for BabyCenter.com.”
A practical example of how anomaly detection makes complex data problems easier to solve.
As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.
At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.
Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:
- Anomalies are defined not by their own characteristics but in contrast to what is normal.
- Before you can spot an anomaly, you first have to figure out what “normal” actually is.
This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…
Collecting actionable data is a challenge for today's data tools
One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:
- Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
- The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
- The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
- Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.
Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
Archimedes advances evidence-based medicine to foster model-based medicine
This posting is by guest author Tuan Dinh, who will speak about this topic at the Strata Rx conference.
Legendary Silicon Valley investor Vinod Khosla caused quite a stir last year when he predicted at Strata Rx that “Dr. Algorithm”–artificial intelligence driven by large data sets and computational power–would replace doctors in the not-too-distant future. At that point, he said, technology will be cheaper, more accurate and objective, and will ultimately do a better job than the average human doctor at delivering routine diagnoses with standard treatments.
I not only support Khosla’s provocative prophecy, I’ll add one of my own: that Dr. Algorithm (aka Dr. A) will “come to life” in three to five years, by the time today’s first-year med school students are pulling 30-hour shifts as new interns. But what will it take to build the brain of Dr. A? And how can we teach Dr. A to account for increasingly complex medical inputs, such as laboratory tests results, genomic/genetic information, family and personal history, co-morbidities and patient preferences, so he can make optimal clinical decisions for living, breathing patients?