Why the term "data science" is flawed but useful

Counterpoints to four common data science criticisms.

Mention “data science” to a lot of the high-profile people you might think practice it and you’re likely to see rolling eyes and shaking heads. It has taken me a while, but I’ve learned to love the term, despite my doubts. The key reason is that the rest of the world understands roughly what I mean when I use it. After years of stumbling through long-winded explanations about what I do, I can now say “I’m a data scientist” and move on. It is still an incredibly hazy definition, but my former descriptions left people confused as well, so this approach is no worse and at least saves time.

With that in mind, here are the arguments I’ve heard against the term, and why I don’t think they should stop its adoption.

It’s not a real science

I just finished reading “The Philosophical Breakfast Club,” the story of four Victorian friends who created the modern structure of science, as well as inventing the word “scientist.” I grew up with the idea that physics, chemistry and biology were the only real sciences and every other subject using the term was just stealing their clothes (“Anything that needs science in the name is not a real science”). The book shows that from the beginning the label was never restricted to just the hard experimental sciences. It was chosen to promote a disciplined approach to reasoning that relied on data rather than the poorly-supported logical deductions many contemporaries favored. Data science fits comfortably in this more open tradition.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

It’s an unnecessary label

To me, it’s obvious that there has been a massive change in the landscape over the last few years. Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques. We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can’t communicate.

The name doesn’t even make sense

As a friend said, “show me a science that doesn’t involve data.” I hate the name myself, but I also know it could be a lot worse. Just look at other fields that suffer under terms like “new archaeology” (now more than 50 years old) or “modernist art” (pushing a century). I learned from teenage bands that the naming process is the most divisive part of any new venture, so my philosophy has always been to take the name you’re given, and rely on time and hard work to give it the right associations. Apple and Microsoft (née Micro-soft) are terrible startup names by any objective measure, but they’ve earned their mindshare. People are calling what we’re doing “data science,” so lets accept that and focus on moving the subject forward.

There’s no definition

This is probably the deepest objection, and the one with the most teeth. There is no widely accepted boundary for what’s inside and outside of data science’s scope. Is it just a faddish rebranding of statistics? I don’t think so, but I also don’t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don’t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist’s approach of choosing the problem first and then finding data to shed light on it. I don’t know what the eventual consensus will be on the limits of data science, but we’re starting to see some outlines emerge.

Time for the community to rally

I’m betting a lot on the persistence of the term. If I’m wrong the Data Science Toolkit will end up sounding as dated as “surfing the information super-highway.” I think data science, as a phrase, is here to stay though, whether we like it or not. That means we as a community can either step up and steer its future, or let others exploit its current name recognition and dilute it beyond usefulness. If we don’t rally around a workable definition to replace the current vagueness, we’ll have lost a powerful tool for explaining our work.


tags: ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Alex Tolley

    I think your responses to objections is just hand waving and generalities. What you might find is that “data science” has as much to do with science, as “domestic science” (aka cooking).

    In the paragraph about the lack of definition, the description of the process could perhaps be characterized as “art”.

  • Hey it gives a name for a group of specialist skills that were previously hard to define to those outside the field. Like the NoSQL term which is constantly under debate, I wouldnt stress too much about the detail but it is useful for broad classification.

  • Alex, I’m happy to dig in deeper, but I’m not sure which arguments you’re disagreeing with. Maybe a starting point would be asking whether you consider economics to be a science, then going along the spectrum to sociology, or computer science?

    Tony, I think you’re right. Ultimately new labels are useful when they describe something real which doesn’t have an existing name, and data science qualifies.

  • Come on. Near as I can tell, it’s just another money spinner meme for Mr. O’Reilly. Web 2.0 redux. I’d rather he spend the money using real Rep-Kover, now that he appears to have deep sixed the faux version which came back from the last killing.

    Not to say there aren’t legitimate “data” based professions: math stats, operations research, relational database development, and the like. I expect to see data science ruined by procedural coders, just like everything else.

  • Oh, and a bit of parallel history. In the late 60’s, there were more guys who wanted to “do computers” than could survive a EE curriculum. The EE was how you got to work with computers. So, comp sci was invented for those not quite smart enough for EE. I see much the same with data sci: can’t cut math stats or OR??? Become a data scientist, and never have to crack Fisher or Wagner, or understand the central limit theorem. And so on. You get the picture.

  • MasterG

    @Robert Young I think a foundation in a quantitative discipline, along with information processing, is essential for data science.

    Also, I didn’t realize that comp sci is supposed to be a watered down version of electrical engineering. It seems that the emphasis is different in each and they both share common foundational courses in their respective curricula: calculus, linear algebra, discrete mathematics, and probability, to name a few. How is data structures and algorithms watered down?

  • I agree with Robert Young. I’ll join the chorus of the dissenters and say that Data Science is what people who don’t know Statistics call Statistics. Most “Data Scientists” state that Data Science lies at the intersection of domain knowledge, data management technology, and Statistics/Machine Learning. In my experience, most of them have a narrow and superficial knowledge of the third, a broad but not deep knowledge of the second, and a very hands-on knowledge of the third. Otherwise stated, Data Scientists are focused on ad hoc tools, not methods. They do get excited for Hadoop and Pig and Cascalog, but if you ask them to explain Linear Regression, they will not go beyond a freshman-level explanation. They vaguely understand Boosting, but don’t know its deep connection to Blackwell’s approachability theorem. Does this knowledge matter? I would say yes, a lot. In 2000 David Donoho called the XXI Century “The Century of Data”, and by this he meant creating *methods* for dealing with large high-dimensional data sets (N.B.: I suspect Donoho would call himself a “Statistician”). What are scarce nowadays are methods, not low-level technologies, and those who innovate and command the methods are called Statisticians (or, for an accident of history and departmental politics, Machine Learners).

    Conversely, you now have “data scientists” content to produce count data on 10Tb of data or a big social network graph. That’s like writing a SQL query at the TJ Watson labs in 1981: looks cool at the time, but it gets commodified very quickly.

    I agree that labels are social constructs. Statistics, for starters, is horrible. And so is Computer Science. However, a lot of dead and alive people we greatly respect called themselves Statisticians, and they defined the field and gave it credibility. I have some reservations dumping the field of Fischer, Tukey, Efron and Stein in exchange for a frothy term coined by a bunch of hackers for marketing reasons.

  • When I select the URL http://www.datasciencetoolkit.org/ I just get a blank screen? This is with Firefox or IE?

    There is not even a title. I can see old cached views of your website via a Google search.

    Ant ideas?

  • Sorry about that, there was a hiccup with the load balancer. It should be working again now.