"stratablog" entries

Big data and privacy: an uneasy face-off for government to face

MIT workshop kicks off Obama campaign on privacy

Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.

Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.

Read more…

Machine Data at Strata: “BigData++”

By David Andrzejewski of SumoLogic

Photo Courtesy of David Andrzejewski

Photo Courtesy of David Andrzejewski

A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, ­two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.

More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the de­coupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data”. ­ In particular two common motifs among the talks at Strata were the difficulties around:

  1. mechanics: the technical details of data collection, storage, and analysis
  2. semantics: extracting understandable and actionable information from the data deluge

Read more…

Change is hard. Adherence is harder.

How do we motivate sustained behavior change when the external motivation disappears—like it's supposed to?

If you’ve ever tried to count calories, go on a diet, start a new exercise program, change your sleep patterns, spend less time sitting, or make any other type of positive health change, then you know how difficult it is to form new habits. New habits usually require a bit of willpower to get going, and we all know that that’s a scarce resource. (Or at least, a limited one.)

Change is hard. But the real challenge comes after you’ve got a new routine going—because now you’ve got to keep it going, even though your original motivations to change may no longer apply. Why keep dieting when you no longer need to lose weight? We’ve all had the idea at some point that we really should reward ourselves for that five-pound weight loss with a cupcake, right?

Read more…

An Invitation to Practical Machine Learning

PracticalMachineLearning_covDoes it make sense for me to have a car? If so, which one is the best choice for my needs: a gasoline, hybrid, or electric?  And should I buy or lease?

In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project.  Will the payoff be worth the effort?  What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?

Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.

Read more…

Interface Languages and Feature Discovery

It's easier to "discover" features with tools that have broad coverage of the data science workflow

Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference.

Interface languages: Python, R, SQL (and Scala)
This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).

Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab, wise.io, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.

In addition many of these new frameworks go out of their way to ease the transition for Python and R users. wise.io “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.

Read more…

Internet of Things in celebration and provocation at MIT

IoTFest reveals exemplary applications as well as challenges

Last Saturday’s IoT Festival at MIT became a meeting ground for people connecting the physical world. Embedded systems developers, security experts, data scientists, and artists all joined in this event. Although it was called a festival, it had a typical conference format with speakers, slides, and question periods. Hallway discussions were intense.

However you define the Internet of Things (O’Reilly has its own take on it, in our Solid blog site and conference), a lot stands in the way of its promise. And these hurdles are more social than technical.

Read more…

Healthcare Lessons from the Data Sages at Strata

Other industries can show health care the way

This article was written with Ellen M. Martin.

Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.

McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.

Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”

Read more…

Extending GraphLab to tables

The popular graph analytics framework extends its coverage of the data science workflow

GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).

The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:

Read more…

Health hackathon brings to life agile solutions for unmet needs

Winners of the Blue Button Innovation Challenge

I think the main achievement of hackathons can be measured not by what apps are developed–reportedly, few are commercialized and maintained–but by people who find each other. The Blue Button Innovation Challenge brought together a lot of professionals who had never met before, and many formed teams that created really fun and useful apps that make you think, “Why hasn’t anyone done this yet?”

Read more…

Bridging the gap between research and implementation

Hardcore Data Science speakers provided many practical suggestions and tips

One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.

Data Structures
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):

Data and Algorithms: The Disconnect

While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning:

Read more…