- US Providers Must Divulge from Offshore Servers (Gigaom) — A U.S. magistrate judge ruled that U.S. cloud vendors must fork over customer data even if that data resides in data centers outside the country. (via Alistair Croll)
- Inside Google’s Self-Driving Car (Atlantic Cities) — Urmson says the value of maps is one of the key insights that emerged from the DARPA challenges. They give the car a baseline expectation of its environment; they’re the difference between the car opening its eyes in a completely new place and having some prior idea what’s going on around it. This is a long and interesting piece on the experience and the creator’s concerns around the self-driving cars. Still looking for the comprehensive piece on the subject.
- Recent Robotics-Relate IPOs — not all the exits are to Google.
- How One Woman Hid Her Pregnancy From Big Data (Mashable) — “I really couldn’t have done it without Tor, because Tor was really the only way to manage totally untraceable browsing. I know it’s gotten a bad reputation for Bitcoin trading and buying drugs online, but I used it for BabyCenter.com.”
ENTRIES TAGGED "analytics"
A practical example of how anomaly detection makes complex data problems easier to solve.
As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.
At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.
Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:
- Anomalies are defined not by their own characteristics but in contrast to what is normal.
- Before you can spot an anomaly, you first have to figure out what “normal” actually is.
This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…
Cloud Jurisdiction, Driverless Cars, Robotics IPOs, and Fitting a Catalytic Convertor to Your Data Exhaust
- 16 Interviewing Tips for User Studies — these apply to many situations beyond user interviews, too.
- The Backlash Against Big Data contd. (Mike Loukides) — Learn to be a data skeptic. That doesn’t mean becoming skeptical about the value of data; it means asking the hard questions that anyone claiming to be a data scientist should ask. Think carefully about the questions you’re asking, the data you have to work with, and the results that you’re getting. And learn that data is about enabling intelligent discussions, not about turning a crank and having the right answer pop out.
- The Science of Science Writing (American Scientist) — also applicable beyond the specific field for which it was written.
Collecting actionable data is a challenge for today's data tools
One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:
- Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
- The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
- The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
- Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.
Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.
Library Box, Data-Driven Racial Profiling, Internet of Washing Machines, and Nokia's IoT R&D
- Librarybox 2.0 — fork of PirateBox for the TP-Link MR 3020, customized for educational, library, and other needs. Wifi hotspot with free and anonymous file sharing. v2 adds mesh networking and more. (via BoingBoing)
- Chicago PD’s Using Big Data to Justify Racial Profiling (Cory Doctorow) — The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there. [...] Asserting that you’re doing science but you can’t explain how you’re doing it is a nonsense on its face. Spot on.
- Cloudwash (BERG) — very good mockup of how and why your washing machine might be connected to the net and bound to your mobile phone. No face on it, though. They’re losing their touch.
- What’s Left of Nokia to Bet on Internet of Things (MIT Technology Review) — With the devices division gone, the Advanced Technologies business will cut licensing deals and perform advanced R&D with partners, with around 600 people around the globe, mainly in Silicon Valley and Finland. Hopefully will not devolve into being a patent troll. [...] “We are now talking about the idea of a programmable world. [...] If you believe in such a vision, as I do, then a lot of our technological assets will help in the future evolution of this world: global connectivity, our expertise in radio connectivity, materials, imaging and sensing technologies.”
Real Time Exploratory Analytics, Algorithmic Agendas, Disassembly Engine, and Future of Employment
- Druid — open source clustered data store (not key-value store) for real-time exploratory analytics on large datasets.
- It’s Time to Engineer Some Filter Failure (Jon Udell) — Our filters have become so successful that we fail to notice: We don’t control them, They have agendas, and They distort our connections to people and ideas. That idea that algorithms have agendas is worth emphasising. Reality doesn’t have an agenda, but the deployer of a similarity metric has decided what features to look for, what metric they’re optimising, and what to do with the similarity data. These are all choices with an agenda.
- Capstone — open source multi-architecture disassembly engine.
- The Future of Employment (PDF) — We note that this prediction implies a truncation in the current trend towards labour market polarization, with growing employment in high and low-wage occupations, accompanied by a hollowing-out of middle-income jobs. Rather than reducing the demand for middle-income occupations, which has been the pattern over the past decades, our model predicts that computerisation will mainly substitute for low-skill and low-wage jobs in the near future. By contrast, high-skill and high-wage occupations are the least susceptible to computer capital. (via The Atlantic)
Data Pipeline, Data Driven Education, Crowdsourced Proofreading, and 3D Printed Shoes
- Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
- NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
- Proofist — crowdsourced proofreading game.
- 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
Archimedes advances evidence-based medicine to foster model-based medicine
This posting is by guest author Tuan Dinh, who will speak about this topic at the Strata Rx conference.
Legendary Silicon Valley investor Vinod Khosla caused quite a stir last year when he predicted at Strata Rx that “Dr. Algorithm”–artificial intelligence driven by large data sets and computational power–would replace doctors in the not-too-distant future. At that point, he said, technology will be cheaper, more accurate and objective, and will ultimately do a better job than the average human doctor at delivering routine diagnoses with standard treatments.
I not only support Khosla’s provocative prophecy, I’ll add one of my own: that Dr. Algorithm (aka Dr. A) will “come to life” in three to five years, by the time today’s first-year med school students are pulling 30-hour shifts as new interns. But what will it take to build the brain of Dr. A? And how can we teach Dr. A to account for increasingly complex medical inputs, such as laboratory tests results, genomic/genetic information, family and personal history, co-morbidities and patient preferences, so he can make optimal clinical decisions for living, breathing patients?