Why the term "data science" is flawed but useful

Mention “data science” to a lot of the high-profile people you might think practice it and you’re likely to see rolling eyes and shaking heads. It has taken me a while, but I’ve learned to love the term, despite my doubts. The key reason is that the rest of the world understands roughly what I mean when I use it. After years of stumbling through long-winded explanations about what I do, I can now say “I’m a data scientist” and move on. It is still an incredibly hazy definition, but my former descriptions left people confused as well, so this approach is no worse and at least saves time.

With that in mind, here are the arguments I’ve heard against the term, and why I don’t think they should stop its adoption.

It’s not a real science

I just finished reading “The Philosophical Breakfast Club,” the story of four Victorian friends who created the modern structure of science, as well as inventing the word “scientist.” I grew up with the idea that physics, chemistry and biology were the only real sciences and every other subject using the term was just stealing their clothes (“Anything that needs science in the name is not a real science”). The book shows that from the beginning the label was never restricted to just the hard experimental sciences. It was chosen to promote a disciplined approach to reasoning that relied on data rather than the poorly-supported logical deductions many contemporaries favored. Data science fits comfortably in this more open tradition.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

It’s an unnecessary label

To me, it’s obvious that there has been a massive change in the landscape over the last few years. Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques. We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can’t communicate.

The name doesn’t even make sense

As a friend said, “show me a science that doesn’t involve data.” I hate the name myself, but I also know it could be a lot worse. Just look at other fields that suffer under terms like “new archaeology” (now more than 50 years old) or “modernist art” (pushing a century). I learned from teenage bands that the naming process is the most divisive part of any new venture, so my philosophy has always been to take the name you’re given, and rely on time and hard work to give it the right associations. Apple and Microsoft (née Micro-soft) are terrible startup names by any objective measure, but they’ve earned their mindshare. People are calling what we’re doing “data science,” so lets accept that and focus on moving the subject forward.

There’s no definition

This is probably the deepest objection, and the one with the most teeth. There is no widely accepted boundary for what’s inside and outside of data science’s scope. Is it just a faddish rebranding of statistics? I don’t think so, but I also don’t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don’t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist’s approach of choosing the problem first and then finding data to shed light on it. I don’t know what the eventual consensus will be on the limits of data science, but we’re starting to see some outlines emerge.

Time for the community to rally

I’m betting a lot on the persistence of the term. If I’m wrong the Data Science Toolkit will end up sounding as dated as “surfing the information super-highway.” I think data science, as a phrase, is here to stay though, whether we like it or not. That means we as a community can either step up and steer its future, or let others exploit its current name recognition and dilute it beyond usefulness. If we don’t rally around a workable definition to replace the current vagueness, we’ll have lost a powerful tool for explaining our work.

Related: