"data science" entries
We are in the early days of productivity technology in data science
Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged to cast their nets wide, and investigate alternative (unstructured) data sources. The general perception is that data wrangling is the province of programmers and data scientists. Spend time around Excel users and you’ll learn that they do quite a bit of data wrangling too!
In my work I tend to write scripts and small programs to do data wrangling. That usually means some combination1 of SQL, Python, and Spark2. I’ve played with Google Refine (now called OpenRefine) in the past, but I found the UI hard to get used to. Part of the problem may have been that I didn’t use the tool often3 enough to become comfortable.
For most users data wrangling still tends to mean a series of steps that usually involves different tools (e.g., you often need to draw charts to spot outliers and anomalies). As I’ve pointed out in previous posts, workflows that involve many different tools require a lot of context-switching, which in turn affects productivity and impedes reproducability.
We are washing our data at the side of the river on stones. We are really in the early, early ages of productivity technology in data science.
Joe Hellerstein (Strata-NYC 2012), co-founder and CEO of Trifacta
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
The inaugural Spark Summit will feature a wide variety of real-world applications
When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be where Spark is today. With over 90 contributors from 25 companies, it has one of the largest developer communities among big data projects (second only to Hadoop MapReduce).
I recently became an advisor to Databricks (a startup commercializing Spark) and a member of the program committee for the inaugural Spark Summit. As I pored over submissions to Spark’s first community gathering, I learned how companies have come to rely on Spark, Shark, and other components of the Berkeley Data Analytics Stack (BDAS). Spark is at that stage where companies are deploying it, and the upcoming Spark Summit in San Francisco will showcase many real-world applications. These applications cut across many domains including advertising, marketing, finance, and academic/scientific research, but can generally be grouped into the following categories:
Data processing workflows: ETL and Data Wrangling
Many companies rely on a wide variety of data sources for their analytic products. That means cleaning, transforming, and fusing (unstructured) external data with internal data sources. Many companies – particularly startups – use Spark for these types of data processing workflows. There are even companies that have created simple user interfaces that open up batch data processing tasks to non-programmers.
Tools for unlocking big data continue to get simpler
Here are a few observations based on conversations I had during the just concluded Strata NYC conference.
Interactive query analysis on Hadoop remains a hot area
A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A year after the launch of Impala, quite a few attendees I spoke with remained interested in the progress of SQL-on-Hadoop solutions. A trio from Hortonworks gave an update on recent improvements and changes to Hive1. A sign that Impala is gaining traction, Greg Rahn’s talk on Practical Performance Tuning for Impala was one of the best attended sessions in the conference. Ditto for a sponsored session on Kognitio’s latest features.
Existing SQL-on-Hadoop solutions require that users define a schema – an additional step given that a lot of data is increasingly in key-value or JSON format. In his talk Hadapt co-founder Daniel Abadi highlighted a solution2 that lets users query complex data types (Hadapt reserializes complex data types to speed up joins). I expect other SQL-on-Hadoop solutions to also offer query support for complex data types in the near future.
Empowering business users
With its launch at the conference, ClearStory joins Platfora and Datameer in the business analytics space. Each company builds tools that lets business users wade through large amounts of data, while emphasizing different areas. Platfora is for interactive visual analysis of massive data sets, while Datameer connects to many data sources (not just Hadoop), has started offering analytics, and can run on a laptop or cluster. Built primarily on the Berkeley stack (BDAS), ClearStory’s interesting platform encourages collaboration and simplifies data harmonization (fusing disparate data sources is a common bottleneck for business users). For organizations willing to tag and describe their data sets, Microsoft unveiled a tool that lets users query data using natural language (UK startup NeutrinoBI uses a similar “search interface”).
As companies continue to use crowdsourcing, demand for people who know how to manage projects remains steady
A little over four years ago, I attended the first Crowdsourcing meetup at the offices of Crowdflower (then called Dolores Labs). The crowdsourcing community has grown explosively since that initial gathering, and there are now conference tracks and conferences devoted to this important industry. At the recent CrowdConf1, I found a community of professionals who specialize in managing a wide array of crowdsourcing projects.
Data scientists were early users of crowdsourcing services. I personally am most familiar with a common use case – the use of crowdsourcing to create labeled data sets for training machine-learning models. But as straightforward as it sounds, using crowdsourcing to generate training sets can be tricky – fortunately there are excellent papers and talks on this topic. At the most basic level, before embarking on a crowdsourcing project you should go through a simple checklist (among other things, make sure you have enough scale to justify engaging with a provider).
Beyond building training sets for machine-learning, more recently crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans2 take care of uncertain cases, models handle the routine ones. The use of ReCAPTCHA to digitize books is an example of this approach. On the flip side, analytics are being used to predict the outcome of crowd-based initiatives: researchers developed models to predict the success of Kickstarter campaigns 4 hours after their launch.
Solving problems with data necessitates a diversity of thought.
There’s a lot of hype around “Big Data” these days. Don’t believe us? None other than the venerable Harvard Business Review named “data scientist” the “Sexiest Job of the 21st Century” only 13 years into it. Seriously. Some of these accolades are deserved. It’s decidedly cheaper to store data now than it is to analyze it, which is considerably different than 10 or 20 years ago. Other aspects, however, are less deserved.
In isolation, big data and data scientists don’t hold some magic formula that’s going to save the world, radically transform businesses, or eliminate poverty. The act of solving problems is decidedly different than amassing a data set the size of200 trillion Moby Dicks or setting a team of nerds loose on the data. Problem solving not only requires a high-level conceptual understanding of the challenge, but also a deep understanding of the nuances of a challenge, how those nuances affect businesses, governments, and societies, and—don’t forget—the creativity to address these challenges.
In our experience, solving problems with data necessitates a diversity of thought and an approach that balances number crunching with thoughtful design to solve targeted problems. Ironically, we don’t believe this means that it’s important to have an army of PhDs with deep knowledge on every topic under the sun.
Rather, we find it’s important to have multi-disciplinary teams of curious, thoughtful, and motivated learners with a broad range of interests who aren’t afraid to immerse themselves in a totally ambiguous topic. With this common vision, IDEO and Datascope Analytics decided to embark on an experiment and integrate our teams to collaborate on a few big data projects over the last year. We thought we’d share a few things here we’ve learned along the way.
Behind the scenes with Datascope Analytics.
During a trip to Chicago for a conference on R, I had a chance to cowork at the Datascope Analytics (DsA) office. While I had worked with co-founders Mike and Dean before, this was my first time coworking at their office. It was an eye-opening experience. Why? The culture. I saw how this team of data scientists with different backgrounds connected with each other as they worked, collaborated, and joked around. I also observed how intensely present everyone was…whether they were joking or working. I completely understand how much work and commitment it takes to facilitate such a creative and collaborative environment.
Over the next few months, this initial coworking experience led to many conversations with Dean and Mike about building data science teams, Strata, design, and data both in Chicago and the SF Bay Area. I also got to know a few of the other team members such as Aaron, Bo, Gabe, and Irmak. Admittedly, the more I got to know the team, the more intensely curious I became about the human-centered design “ideation” workshops that they hold for clients. According to Aaron, the workshops “combine elements from human-centered design to diverge and converge on valuable and viable ideas, solutions, strategies for our clients. We start by creating an environment that spurs creativity and encourages wild ideas. After developing many different ideas, we cull them down and focus on the ones that are viable to add life and meaning.”
Deep Neural Nets excel at perception tasks. What’s changed since the 1980s? Access to more data and faster computation tools
This past week I had the good fortune of attending two great talks1 on Deep Learning, given by Googlers Ilya Sutskever and Jeff Dean. Much of the excitement surrounding Deep Learning stems from impressive results in a variety of perception tasks, including speech recognition (Google voice search) and visual object recognition (G+ image search).
Data scientists seek to generate information and patterns from raw data. In practice this usually means learning a complicated function for handling a specified task (classify, cluster, predict, etc.). One approach to machine learning mimics how the brain works: starting with basic building blocks (neurons), it approximates complex functions by finding optimal arrangements of neurons (artificial neural networks).
One of the most cited papers in the field showed that any continuous function can be approximated, to arbitrary precision, by a neural network with a single hidden layer. This led some to think that neural networks with single hidden layers would do well on most machine-learning tasks. However this universal approximation property came at a steep cost: the requisite (single hidden layer) neural networks were exponentially inefficient to construct (you needed a neuron for every possible input). For a while neural networks took a backseat to more efficient and scalable techniques like SVM and Random Forest.
At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions
A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. These tools excel at data processing and are also used for data mining – in many cases users have to write a bit of code1 to do stream mining. The good news is that easy-to-use stream mining libraries will likely emerge in the near future.
High volume data streams (data that arrive continuously) arise in many settings, including IT operations, sensors, and social media. What can one learn by looking at data one piece (or a few pieces) at a time? Can techniques that look at smaller representations of data streams be used to unlock their value? In this post, I’ll briefly summarize a recent overview given by stream mining pioneer Graham Cormode.
Massive amounts of data arriving at high velocity pose a challenge to data miners. At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions:
Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries (frequency moments in stream mining parlance) include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items.
One of the chapters of Think Bayes is based on a class project two of my students worked on last semester. It presents “The Red Line Problem,” which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.
Here’s the introduction:
In Boston, the Red Line is a subway that runs between Cambridge and Boston. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7–8 minutes, on average.
When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.
While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. This chapter presents the analysis I came up with.
Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. But I think the analysis is interesting, and still applies for subway systems that don’t provide estimates.