Bridging the gap between research and implementation

Hardcore Data Science speakers provided many practical suggestions and tips

One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.

Data Structures
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):

Data and Algorithms: The Disconnect

While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning:

Data Structures: Graphs and Flat Tables

Practical considerations: implementation, interpretability, accuracy, scalability
Josh Bloom and Henrik Brink of wise.io talked about the importance of treating machine-learning deployment on par with the deployment of other critical software tools. This means having to maintain reproducible workflows while having to optimize between different decision factors (e.g. fast vs. accurate vs. interpretable).

ML Algorithms tradeoff space

Kira Radinsky gave a detailed overview of the interesting analytic tools that power SalesPredict. She reiterated the importance of explainability, to the users of data science products. Those users also need to be able to “act” on results produced by algorithms. Kira described a tool (built using social network analysis) to help users connect with prospects produced by SalesPredict. She closed her presentation by hinting at an interesting system that can be used to trace connections2 and correlations, and gave an illustration3 of how companies and sales people may use such a system. Built primarily using web documents and 30M newspaper articles, the SalesPredict causality graph currently consists of 300 million nodes and 1 billion edges (stored in a small cluster).

SalesPredict Causality Graph

Other bits
Ted Dunning gave an interesting and highly-rated overview of anomaly detection, which included tips and hints on how to build systems that combine real-time and long-time data processing and analysis.

Real-time and Longtime

Strata is an event that draws (academic) researchers and practitioners, and hopefully spurs conversations between the two communities. Magda Balazinska presented three interesting academic research projects designed to make big data management systems easier. Judging from the Q&A and the Twitter stream, these academic projects drew lots of interest from the audience.

Oscar Boykin ended the day by describing the use of algebra (monoids) and hashes in data processing and data mining systems already in production at Twitter and other companies. Hashed data can be represented as sets, maps and other objects that are familiar to programmers. And since hashed sketches are monoids they are efficient to use in map/reduce or streaming applications.

Hashes and Monoids

Related Content:


(1) Previewed at Strata, GraphLab’s SFrame is a scalable, tabular data structure that lets users perform many data wrangling tasks.
(2) A big data tool, reminiscent of the software James Burke used in his Strata keynote presentation.
(3) She described how the Causality Graph uncovered how the tsunami impacted the iPad’s supply chain and sales.

tags: , , ,