"data science" entries
Solving problems with data necessitates a diversity of thought.
There’s a lot of hype around “Big Data” these days. Don’t believe us? None other than the venerable Harvard Business Review named “data scientist” the “Sexiest Job of the 21st Century” only 13 years into it. Seriously. Some of these accolades are deserved. It’s decidedly cheaper to store data now than it is to analyze it, which is considerably different than 10 or 20 years ago. Other aspects, however, are less deserved.
In isolation, big data and data scientists don’t hold some magic formula that’s going to save the world, radically transform businesses, or eliminate poverty. The act of solving problems is decidedly different than amassing a data set the size of200 trillion Moby Dicks or setting a team of nerds loose on the data. Problem solving not only requires a high-level conceptual understanding of the challenge, but also a deep understanding of the nuances of a challenge, how those nuances affect businesses, governments, and societies, and—don’t forget—the creativity to address these challenges.
In our experience, solving problems with data necessitates a diversity of thought and an approach that balances number crunching with thoughtful design to solve targeted problems. Ironically, we don’t believe this means that it’s important to have an army of PhDs with deep knowledge on every topic under the sun.
Rather, we find it’s important to have multi-disciplinary teams of curious, thoughtful, and motivated learners with a broad range of interests who aren’t afraid to immerse themselves in a totally ambiguous topic. With this common vision, IDEO and Datascope Analytics decided to embark on an experiment and integrate our teams to collaborate on a few big data projects over the last year. We thought we’d share a few things here we’ve learned along the way.
Behind the scenes with Datascope Analytics.
During a trip to Chicago for a conference on R, I had a chance to cowork at the Datascope Analytics (DsA) office. While I had worked with co-founders Mike and Dean before, this was my first time coworking at their office. It was an eye-opening experience. Why? The culture. I saw how this team of data scientists with different backgrounds connected with each other as they worked, collaborated, and joked around. I also observed how intensely present everyone was…whether they were joking or working. I completely understand how much work and commitment it takes to facilitate such a creative and collaborative environment.
Over the next few months, this initial coworking experience led to many conversations with Dean and Mike about building data science teams, Strata, design, and data both in Chicago and the SF Bay Area. I also got to know a few of the other team members such as Aaron, Bo, Gabe, and Irmak. Admittedly, the more I got to know the team, the more intensely curious I became about the human-centered design “ideation” workshops that they hold for clients. According to Aaron, the workshops “combine elements from human-centered design to diverge and converge on valuable and viable ideas, solutions, strategies for our clients. We start by creating an environment that spurs creativity and encourages wild ideas. After developing many different ideas, we cull them down and focus on the ones that are viable to add life and meaning.”
Deep Neural Nets excel at perception tasks. What’s changed since the 1980s? Access to more data and faster computation tools
This past week I had the good fortune of attending two great talks1 on Deep Learning, given by Googlers Ilya Sutskever and Jeff Dean. Much of the excitement surrounding Deep Learning stems from impressive results in a variety of perception tasks, including speech recognition (Google voice search) and visual object recognition (G+ image search).
Data scientists seek to generate information and patterns from raw data. In practice this usually means learning a complicated function for handling a specified task (classify, cluster, predict, etc.). One approach to machine learning mimics how the brain works: starting with basic building blocks (neurons), it approximates complex functions by finding optimal arrangements of neurons (artificial neural networks).
One of the most cited papers in the field showed that any continuous function can be approximated, to arbitrary precision, by a neural network with a single hidden layer. This led some to think that neural networks with single hidden layers would do well on most machine-learning tasks. However this universal approximation property came at a steep cost: the requisite (single hidden layer) neural networks were exponentially inefficient to construct (you needed a neuron for every possible input). For a while neural networks took a backseat to more efficient and scalable techniques like SVM and Random Forest.
At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions
A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. These tools excel at data processing and are also used for data mining – in many cases users have to write a bit of code1 to do stream mining. The good news is that easy-to-use stream mining libraries will likely emerge in the near future.
High volume data streams (data that arrive continuously) arise in many settings, including IT operations, sensors, and social media. What can one learn by looking at data one piece (or a few pieces) at a time? Can techniques that look at smaller representations of data streams be used to unlock their value? In this post, I’ll briefly summarize a recent overview given by stream mining pioneer Graham Cormode.
Massive amounts of data arriving at high velocity pose a challenge to data miners. At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions:
Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries (frequency moments in stream mining parlance) include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items.
One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.
Here’s the introduction:
In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.
When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.
While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.
Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.
Organize solutions into clusters and “force multiply” feedback provided by instructors
One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classes with a few (hundred) thousand students aren’t unusual.
Researchers at Stanford recently combed through over one million homework submissions from a large MOOC class offered in 2011. Students in the machine-learning course submitted programming code for assignments that consisted of several small programs (the typical submission was about 16 lines of code). While over 120,000 enrolled only about 10,000 students completed all homework assignments (about 25,000 submitted at least one assignment).
The researchers were interested in figuring out ways to ease the burden of grading the large volume of homework submissions. The premise was that by sufficiently organizing the “space of possible solutions”, instructors would provide feedback to a few submissions, and their feedback could then be propagated to the rest.
Accuracy, simplicity, speed, and interpretability are some of the factors that need to be considered
For companies in the early stages of grappling with big data, the analytic lifecycle (model building, deployment, maintenance) can be daunting. In earlier posts I highlighted some new tools that simplify aspects of the analytic lifecycle, including the early phases of model building. But while tools are allowing companies to offload routine analytic tasks to business analysts, experienced modelers are still needed to fine-tune and optimize, mission-critical algorithms.
Model Selection: Accuracy and other considerations
Accuracy1 is the main objective and a lot of effort goes towards raising it. But in practice tradeoffs have to be made, and other considerations play a role in model selection. Speed (to train/score) is important if the model is to be used in production. Interpretability is critical if a model has to be explained for transparency2 reasons (“black-boxes” are always an option, but are opaque by definition). Simplicity is important for practical reasons: if a model has “too many knobs to tune” and optimizations have to be done manually, it might be too involved to build and maintain it in production3.
Chances are a model that’s fast, easy to explain (interpretable), and easy to tune (simple), is less4 accurate. Experienced model builders are valuable precisely because they’ve weighed these tradeoffs across many domains and settings. Unfortunately not many companies have the experts that can identify, build, deploy, and maintain models at scale. (An example from Google illustrates the kinds of issues that can come up.)
A new startup will accelerate the maturation of the Berkeley Data Analytics Stack
Key technologists behind the Berkeley Data Analytics Stack (BDAS) have launched a company that will build software – centered around Apache Spark and Shark – for analyzing big data. Details of their product and strategy are sparse, as the company is operating in stealth mode. But through conversations with the founders of Databricks, I’ve learned that they’ll be building general purpose analytic tools that can leverage HDFS, YARN, as well as other components of BDAS.
It will be interesting to see how the team transitions to the corporate world. Their Series A funding round of $14M is being led by Andreessen Horowitz. The board will be composed of Ben Horowitz, Scott Shenker, Matei Zaharia, and Ion Stoica.
A general purpose stream processing framework from the team behind Kafka and new techniques for computing approximate quantiles
Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1 and send data streams. Kafka was developed by Linkedin were it remains a vital component of their Big Data ecosystem: many critical online and offline data flows rely on feeds supplied by Kafka servers.
Apache Samza: a distributed stream processing framework
Behind Kafka’s success as an open source project is a team of savvy engineers who have spent2 the last three years making it a rock solid system. The developers behind Kafka realized early on that it was best to place the bulk of data processing (i.e., stream processing) in another system. Armed with specific use cases, work on Samza proceeded in earnest about a year ago. So while they examined existing streaming frameworks (such as Storm, S4, Spark Streaming), Linkedin engineers wanted a system that better fit their needs3 and requirements:
A distributed, near real-time system simplifies the collection, storage, and mining of massive amounts of event data
One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and alert teams when problems occur. The Observability stack collects 170 million individual metrics (time-series) every minute and serves up 200 million queries per day. Simple query tools are used to populate charts and dashboards (a typical user monitors about 47 charts).
The stack is about three years old1 and consists of instrumentation2 (data collection primarily via Finagle), storage (Apache Cassandra), a query language and execution engine3, visualization4, and basic analytics. Four distinct Cassandra clusters are used to serve different requirements (real-time, historical, aggregate, index). A lot of engineering work went into making these tools as simple to use as possible. The end result is that these different pieces provide a flexible and interactive framework for developers: insert a few lines of (instrumentation) code and start viewing charts within minutes5.