- Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
- NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
- Proofist — crowdsourced proofreading game.
- 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
Archimedes advances evidence-based medicine to foster model-based medicine
This posting is by guest author Tuan Dinh, who will speak about this topic at the Strata Rx conference.
Legendary Silicon Valley investor Vinod Khosla caused quite a stir last year when he predicted at Strata Rx that “Dr. Algorithm”–artificial intelligence driven by large data sets and computational power–would replace doctors in the not-too-distant future. At that point, he said, technology will be cheaper, more accurate and objective, and will ultimately do a better job than the average human doctor at delivering routine diagnoses with standard treatments.
I not only support Khosla’s provocative prophecy, I’ll add one of my own: that Dr. Algorithm (aka Dr. A) will “come to life” in three to five years, by the time today’s first-year med school students are pulling 30-hour shifts as new interns. But what will it take to build the brain of Dr. A? And how can we teach Dr. A to account for increasingly complex medical inputs, such as laboratory tests results, genomic/genetic information, family and personal history, co-morbidities and patient preferences, so he can make optimal clinical decisions for living, breathing patients?
Evolution from a research tool to a platform for patient engagement
Bruce Springer of OneHealth will speak about this topic at the Strata Rx conference. This article was written by Patrick Bane of OneHealth in coordination with Bruce Springer.
According to a recent study performed by the Jesse Brown VA Medical Center and University of Illinois at Chicago, patient-centered care has demonstrated positive outcomes on patients’ health, patients’ self-report of health, and reduced healthcare utilization. The study’s results are consistent with previous research that the patient-centered care model improves the quality of care while simultaneously lowering the cost of care.
OneHealth’s behavior change platform extends the patient-centered model by connecting members anytime, anywhere through mobile and web applications. Member generate data in their daily lives, outside of a clinical setting, which creates a much richer dataset of behaviors that are required to understand the patients’ condition(s), and their readiness to change. Members freely choose what to do and their choices actively generate data in five classes of information:
A video interview with Colin Hill
Last month, Strata Rx Program Chair Colin Hill, of GNS Healthcare, sat down with Dr. Dennis Ausiello, Jackson Professor of Clinical Medicine at the Harvard Medical School, Co-Director at CATCH, Pfizer Board of Directors Member, and Former Chief of Medicine at the Massachusetts General Hospital (MGH), for a fireside chat at a private reception hosted by GNS. Their insightful conversation covered a range of topics that all touched on or intersected with the need to create smaller and more precise cohorts, as well as the need to focus on phenotypic data as much as we do on genotypic data.
The full video appears below.
A tool for outreach to patients produces unexpected benefits
The traditional, office-based model for health care is episodic. The provider-patient relationship exists almost completely within the walls of the exam room, with little or no follow-up between visits. Data is primarily episodic as well, based on blood pressure reading done at a specific time or surveys administered there and then, with little collected out of the office. And even the existing data collection tools—paper diaries or clunky meters—are focused more on storing data that on connecting the patient and provider through that data in real time.
There is no way to get in touch when, for instance, a patient’s blood sugar starts varying wildly or pain levels change. The provider often depends on the patient reaching out to them. And even when a provider does put into place an outreach protocol, it is usually very crude, based on a general approach to managing a population as opposed to an understanding of a patient. The end result is a system that, while doing its best within a difficult setting, is by default reactive instead of proactive.
Business analytics projects: Using decisions as a basis to prioritize and identify requirements
Most normal people don’t look at data sets just for fun. They study views of the data to make decisions about what to do, be it a decision to take some specific action or a decision to do nothing at all. The main purpose of business analytics projects is to develop systems that turn large and often highly complex data sets into meaningful information from which decisions can be made.
The decisions that people make using business analytics systems can be strategic, operational, or tactical. For example, an executive might look at his sales team’s global performance dashboard to decide who to promote (tactical), which products need different marketing strategies (operational), or which products to target by markets (strategic). Generally speaking, all software systems that include an analytics component should enable users to make decisions that improve organizational performance in some dimension.
Modern data processing tools, many of them open source, allow more clinical studies at lower costs
This guest posting was written by Yadid Ayzenberg (@YadidAyzenberg on Twitter). Yadid is a PhD student in the Affective Computing Group at the MIT Media Lab. He has designed and implemented cloud platforms for the aggregation, processing and visualization of bio-physiological sensor data. Yadid will speak on this topic at the Strata Rx conference.
A few weeks ago, I learned that the Framingham Heart Study would lose $4 million (a full 40 percent of its funding) from the federal government due to automatic spending cuts. This seminal study, begun in 1948, set out to identify the contributing factors to Cardiovascular Disease (CVD) by following a group of 5,209 men and woman and tracking their life style habits, performing regular physical examinations and lab tests. This study was responsible for finding the major risk factors for CVD, such as high blood pressure and lack of exercise. The costs associated with such large-scale clinical studies are prohibitive, making them accessible only to organizations with sufficient financial resources or through government funding.
Long a development tool, TestFlightApp wants to move into analytics
For most iOS developers, TestFlightApp has become the go-to tool when they want to distribute a development build to testers. For those not familiar with the site, you can register applications, and then upload IPA files signed with either a development or AdHoc profile, either manually or using a desktop app that integrates directly into XCode.
Once uploaded, your testers can be automatically notified via email that there is a new version of the app available, and download it directly onto their device without having to use iTunes. It can even capture device IDs for new users (or new devices for existing users), and export them for use in the Apple developer portal.
You can also add code to have the running app check in with TestFlight. You can add “checkpoint” flags, ask survey questions (“why did you come to this page”), and have console logs and crash reports automatically uploaded to the site.
The problem is, once you’re ready to ship a production version, you have traditionally had to turn everything off and make sure that the Test Flight library was not linked in to the app. This has meant that there was no way to capture crash data from customers running the app. But now that’s changing.
Recently, TestFlightApp announced that it was now OK to leave the library in copies of your app uploaded to the App Store, and to have the app check in with TestFlight. Why the change? Probably because it is needed to support FlightPath, their new analytics tool. FlightPath seems to want to be the Google Analytics of mobile, allowing developers to see how customers use their app and offering demographic data.
FlightPath is likely to be the path that TestFlightApp takes to start monetizing their service. TestFlightApp has always been free, but there has been no pronouncement about whether FlightPath will follow that same model. It is currently in an open beta, so we may have to wait and see what the pricing model for the final product is. Of course, by then, all those beta users will have become hooked.
One major caution for people intending to keep TestFlight in their production code, watch out for leakage of private data! Many test builds spit out tons of information to the console. At times, I’ve had everything going back and forth to a server dumping itself onto the log. If you don’t disable that in the shipping code, you could be accidentally capturing all sorts of sensitive data, including credit cards, HIPPA restricted information, etc. So make sure that you have compiled out (or disabled) anything like that in the production build (which you can test with an AdHoc profile.)