"data scientists" entries
Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data.
Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.
Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques. A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:
“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”
Schemas inevitably will change — Apache Avro offers an elegant solution.
When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?
This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.
The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.
There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them. Read more…
What business leaders need to know about data and data analysis to drive their businesses forward.
Foster and Tom have a long history of applying data to practical business problems. Their book, which evolved into Data Science for Business, was different from all the other data science books I’ve seen. It wasn’t about tools: Hadoop and R are scarcely mentioned, if at all. It wasn’t about coding: business students don’t need to learn how to implement machine learning algorithms in Python. It is about business: specifically, it’s about the data analytic thinking that business people need to work with data effectively.
Data analytic thinking means knowing what questions to ask, how to ask those questions, and whether the answers you get make sense. Business leaders don’t (and shouldn’t) do the data analysis themselves. But in this data-driven age, it’s critically important for business leaders to understand how to work with the data scientists on their teams. Read more…
Training Aspiring Data Scientists in Chicago
As technology penetrates further into everyday life, we’re creating lots of data. Businesses are scrambling to find data scientists to make sense of all this data and turn it into better decisions.
Businesses aren’t alone. Data science could transform how governments and nonprofits tackle society’s problems. The problem is, most governments and nonprofits simply don’t know what’s possible yet. There are too few data scientists out there and too many spending their days optimizing ads instead of bettering lives. To make real impact with data, we need to work on high-impact projects that show these organizations the power of analytics. And we need to expose data scientists to the problems that really matter.
That’s exactly why we’re doing the Eric and Wendy Schmidt Data Science for Social Good summer fellowship at the University of Chicago. The program is led by Rayid Ghani, former chief data scientist for the 2012 Obama campaign, and is funded by Google Chairman Eric Schmidt.
We’ve brought three dozen aspiring data scientists from all over the world to Chicago to spend a summer working on data science projects with social impact. The fellows are working closely with governments and nonprofits (including the City of Chicago, the Chicago Transit Authority, and the Nurse-Family Partnership) to take on real-world problems in education, health, energy, transportation, and more. (To read up on our project, check out dssg.io/projects or to get involved, go to github.com/dssg.)
Data scientists are a hybrid group with computer science, statistics, machine learning, data mining, and database skills. These skills take years to learn and there’s no way to teach all of them during a few weeks. Instead of starting from scratch, we decided to start with students in computational and quantitative fields – folks that already have some of these skills and use them daily in an academic setting. And we gave them the opportunity to apply their abilities to solve real-world problems and to pick up the skills they’re missing.
Our tools should make common cases easy and safe, but that's not the reality today.
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …
There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.
Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well? Read more…
Arguments are the glue that connects data to decisions
Data is key to decision making. Yet we are rarely faced with a situation where things can be put in to such a clear logical form that we have no choice but to accept the force of evidence before us. In practice, we should always be weighing alternatives, looking for missed possibilities, and considering what else we need to figure out before we can proceed.
Arguments are the glue that connects data to decisions. And if we want good decisions to prevail, both as decision makers and as data scientists, we need to better understand how arguments function. We need to understand the best ways that arguments and data interact. The statistical tools we learn in classrooms are not sufficient alone to deal with the messiness of practical decision-making.
Examples of this fill the headlines. You can see evidence of rigid decision-making in how the American medical establishment decides what constitutes a valid study result. By custom and regulation, there is an official statistical breaking point for all studies. Below this point, a result will be acted upon. Above, it won’t be. Cut and dry, but dangerously brittle.
My obsession with data and user needs is now focused on the many paths toward data science.
Over Thanksgiving, Richie and Violet asked me if I preferred the iPhone or the Galaxy SIII. I have both. It is a long story. My response was, “It depends.” Richie, who would probably bleed Apple if you cut him, was very unsatisfied with my answer. Violet was more diplomatic. Yet, it does depend. It depends on what the user wants to use the device for.
I say, “It depends” a lot in my life.
Both in the personal life and the work life … well, because it really is all one life isn’t it? With my work over the past decade or so, I have been obsessive about being user-focused. I spend a lot of time thinking about whom a product, feature, or service is for and how they will use it. Not how I want them to use it — how they want to use it and what problem they are trying to solve with it.
Before I joined O’Reilly, I was obsessively focused on the audience for my data analysis. “C” level execs look for different kinds of insights than a director of engineering. A field sales rep looks for different insights than a software developer. Understanding more about who the user or audience was for a data project enabled me to map the insights to the user’s role, their priorities, and how they wanted to use the data. Because, you know what isn’t too great? When you spend a significant amount of time working on something that does not get used or is not what someone needed to help them in their job.
Big data is shaping diverse fields, showing that past predictions from data-driven natural sciences are now coming to pass.
I find myself having conversations recently with people from increasingly diverse fields, both at Columbia and in local startups, about how their work is becoming “data-informed” or “data-driven,” and about the challenges posed by applied computational statistics or big data.
A view from health and biology in the 1990s
In discussions with, as examples, New York City journalists, physicists, or even former students now working in advertising or social media analytics, I’ve been struck by how many of the technical challenges and lessons learned are reminiscent of those faced in the health and biology communities over the last 15 years, when these fields experienced their own data-driven revolutions and wrestled with many of the problems now faced by people in other fields of research or industry.
It was around then, as I was working on my PhD thesis, that sequencing technologies became sufficient to reveal the entire genomes of simple organisms and, not long thereafter, the first draft of the human genome. This advance in sequencing technologies made possible the “high throughput” quantification of, for example,
- the dynamic activity of all the genes in an organism; or
- the set of all protein-protein interactions in an organism; or even
- statistical comparative genomics revealing how small differences in genotype correlate with disease or other phenotypes.
These advances required formation of multidisciplinary collaborations, multi-departmental initiatives, advances in technologies for dealing with massive datasets, and advances in statistical and mathematical methods for making sense of copious natural data. Read more…