- Eight Docker Development Patterns (Vidar Hokstad) — patterns for creating repeatable builds that result in as-static-as-possible server environments.
- How to Make More Published Research True (PLOSmedicine) — overview of efforts, and research on those efforts, to raise the proportion of published research which is true.
- Gearpump — Intel’s “actor-driven streaming framework”, initial benchmarks shows that we can process 2 million messages/second (100 bytes per message) with latency around 30ms on a cluster of 4 nodes.
- Foundations of Data Science (PDF) — These notes are a first draft of a book being written by Hopcroft and Kannan [of Microsoft Research] and in many places are incomplete. However, the notes are in good enough shape to prepare lectures for a modern theoretical course in computer science.
"data science" entries
Researchers and startups are building tools that enable feature discovery.
Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:
Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)
The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:
Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).
New report covers areas of innovation and their difficulties
O’Reilly recently released a report I wrote called The Information Technology Fix for Health: Barriers and Pathways to the Use of Information Technology for Better Health Care. Along with our book Hacking Healthcare, I hope this report helps programmers who are curious about Health IT see what they need to learn and what they in turn can contribute to the field.
Computers in health are a potentially lucrative domain, to be sure, given a health care system through which $2.8 trillion, or $8.915 per person, passes through each year in the US alone. Interest by venture capitalists ebbs and flows, but the impetus to creative technological hacking is strong, as shown by the large number of challenges run by governments, pharmaceutical companies, insurers, and others.
Some things you should consider doing include:
- Join open source projects
- Numerous projects to collect and process health data are being conducted as free software; find one that raises your heartbeat and contribute. For instance, the most respected health care system in the country, VistA from the Department of Veterans Affairs, has new leadership in OSEHRA, which is trying to create a community of vendors and volunteers. You don’t need to understand the oddities of the MUMPS language on which VistA is based to contribute, although I believe some knowledge of the underlying database would be useful. But there are plenty of other projects too, such as the OpenMRS electronic record system and the projects that cooperate under the aegis of Open Health Tools.
Ignore the hype. Learn to be a data skeptic.
Yawn. Yet another article trashing “big data,” this time an op-ed in the Times. This one is better than most, and ends with the truism that data isn’t a silver bullet. It certainly isn’t.
I’ll spare you all the links (most of which are much less insightful than the Times piece), but the backlash against “big data” is clearly in full swing. I wrote about this more than a year ago, in my piece on data skepticism: data is heading into the trough of a hype curve, driven by overly aggressive marketing, promises that can’t be kept, and spurious claims that, if you have enough data, correlation is as good as causation. It isn’t; it never was; it never will be. The paradox of data is that the more data you have, the more spurious correlations will show up. Good data scientists understand that. Poor ones don’t.
It’s very easy to say that “big data is dead” while you’re using Google Maps to navigate downtown Boston. It’s easy to say that “big data is dead” while Google Now or Siri is telling you that you need to leave 20 minutes early for an appointment because of traffic. And it’s easy to say that “big data is dead” while you’re using Google, or Bing, or DuckDuckGo to find material to help you write an article claiming that big data is dead. Read more…
Focusing attention on the present lets organizations pursue existing opportunities as opposed to projected ones
Slow and Unaware
It was 2005. The war in Iraq was raging. Many of us in the national security R&D community were developing responses to the deadliest threat facing U.S. soldiers: the improvised explosive device (IED). From the perspective of the U.S. military, the unthinkable was happening each and every day. The world’s most technologically advanced military was being dealt significant blows by insurgents making crude weapons from limited resources. How was this even possible?
The war exposed the limits of our unwavering faith in technology. We depended heavily on technology to provide us the advantage in an environment we did not understand. When that failed, we were slow to learn. Meanwhile the losses continued. We were being disrupted by a patient, persistent organization that rapidly experimented and adapted to conditions on the ground.
To regain the advantage, we needed to start by asking different questions. We needed to shift our focus from the devices that were destroying U.S. armored vehicles to the people responsible for building and deploying the weapons. This motivated new approaches to collect data that could expose elements of the insurgent network.
New organizations and modes of operation were also required to act swiftly when discoveries were made. By integrating intelligence and special operations capabilities into a single organization with crisp objectives and responsive leadership, the U.S. dramatically accelerated its ability to disrupt insurgent operations. Rapid orientation and action were key in this dynamic environment where opportunities persisted for an often unknown and very limited period of time.
This story holds important and under appreciated lessons that apply to the challenges numerous organizations face today. The ability to collect, store, and process large volumes of data doesn’t confer advantage by default. It’s still common to fixate on the wrong questions and fail to recover quickly when mistakes are made. To accelerate organizational learning with data, we need to think carefully about our objectives and have realistic expectations about what insights we can derive from measurement and analysis.
More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists
Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).
Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.
CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.
It's easier to "discover" features with tools that have broad coverage of the data science workflow
Interface languages: Python, R, SQL (and Scala)
This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).
Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab, wise.io, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.
In addition many of these new frameworks go out of their way to ease the transition for Python and R users. wise.io “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.
Hardcore Data Science speakers provided many practical suggestions and tips
One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):
While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning: