Data science democratized

With new tools arriving, data science may soon be in the hands of non-programmers.

I am not a data scientist. Nor am I a programmer. I’ve got an inclination toward technology, but my core skill set very much resides in the humanities domain.

I offer this biographical sketch up front because I think I have a lot in common with the people who work around and near tech spaces: academics, business users, entertainment professionals, editors, writers, producers, etc. The interesting thing about data science — and the reason why I’m glad Mike Loukides wrote “What is data science?” — is that vast stores of data have relevance to all sorts of folks, including people like me who lack a pure technical pedigree.

Data science’s democratizing moment will come when its associated tools can be picked up by tech-savvy non-programmers. I’m thinking of the HTML coders and the Excel power users: the people who aren’t full-fledged mechanics, but they’re skilled enough to pop the hood and change their own oil.

I’m encouraged because that democratizing moment is close. I saw a demo recently that connects a web-based spreadsheet with huge data stores and cloud infrastructure. This type of system — and I’m sure there are many others in the pipeline — takes a process that once had immense technical and financial barriers and makes it almost as easy as phpMyAdmin. That’s an important step. Within a year or two, I expect to see further usability improvements in these tools. A data science dashboard that mimics Google Analytics can’t be far off.

During that demo, Datameer CTO Stefan Groschupf told me about a fun Twitter inquiry he instigated. Groschupf had previously gathered around 45 million tweets and fed them into EC2. Later, over the course of a two-beer evening, Groschupf poked at that data to see if any interesting patterns turned up when comparing two vastly different hashtags (#justinbieber vs. #teaparty). He used his company’s system to parse the data, then he fed results through a free visualization tool.

Here’s the #justinbieber cluster:

Justin Bieber hashtag visualization

And here’s the #teaparty cluster:

Teaparty hashtag visualization

As you can see, the #teaparty folks are far more connected then their distant #justinbieber cousins. That’s interesting, but not really surprising. The political world has more connective tissue than of-the-moment entertainment.

But that specific conclusion isn’t what’s important here. Even if your end-point is inevitable, a data-driven conversation has more power and resonance than an anecdotal observation. Groschupf didn’t tell me the Tea Party movement is more connected. He showed me.

Significant implications emerge when you can bounce a question, even an innocuous one, against a huge storehouse of data. If someone like me can plug questions into a system and have it do the same kind of processing once reserved for a skilled minority, that will inspire me to ask a lot more questions. It’ll inspire a lot of other people to ask questions, too. And some of those questions might even be important.

That’s a big deal. Myself and others may never become full-fledged data scientists, but having access to easy-to-use data tools will get people thinking and exploring in all sorts of domains.


tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Alex Tolley

    “If someone like me can plug questions into a system and have it do the same kind of processing once reserved for a skilled minority, that will inspire me to ask a lot more questions.”

    I think it is bigger than that. Asking questions from data is analogous to viewing the world through the lens of science, rather than received wisdom from authority. That has the effect of changing how you think about the world – a qualitative change. Of course the questions need to be good (but we can share those) and the data and analysis need to be good too (and we can share that).

    BTW, whilst the justinbieber vs teaparty analysis may seem obvious to you, it isn’t to me. I’d like to see more examples to demonstrate the case that these situations are indeed different, which would might to classification of social networks, which might lead to different questions about social systems.

  • Nailles McCarthy

    This is interesting. What exactly is this connectivity that it’s measuring?

    Although it seems to me this would be only measuring one kind of connectivity, the connectivity as it relates to Twitter and to a particular set of attributes within that domain name (for instance, connectivity: as it relates to those who post links, or those who post friends). BieberBuddies might be more connected than the partiers using qualifiers not made available by Twitter.

  • K. Black

    I support the movement for open data. However, I strongly disagree with the premise that a “data driven” conversation is a good thing. Good science starts with the construction of an hypothesis, constructing a well designed, responsible test of the hypothesis, followed by a premeditated, responsible analysis.

    The idea that people will be able to make responsible use of data that is easily gathered without a fundamental understanding of the limitations of sampling, statistical analysis, and graphical representation is plain wrong. This is essentially a large scale data-mining exercise with no regard to the issues of type-I and type-II errors. Data can and occasionally does lie. The emergence of these kind of tools highlights the intense need for a mathematically literate society.

  • Alex Tolley


    I agree with your basic comment. In addition, this is almost made worse by the what I would call fetishizing “big data”. This big data meme was certainly popularized in a Wired article last year and really showed little understanding about what science is. Even scientists I have worked with show little appreciation for statistics and think that you can rely on data mined results for interesting relationships when using the 5% probability bounds as significant.

    However, I would caution against throwing out the baby with the bath water. Some professions already use public data, e.g. economists. Given their track record it could hardly be much worse if more people could easily access government data and analyze it.

    Even when data is not of the best quality, you can still use it to try to put some boundaries around your thinking. For example, most people in the US think foreign aid spending as a % GDP is about 10-100x larger than it is. Even a cursory inspection of the data would rule large numbers out of a conversation and undermine the argument for cutting foreign aid and redirecting it to domestic spending.

  • Joe McCarthy

    I agree that the growing availability of data – and tools for querying and/or visualizing the data – empowers non-experts to ask and answer interesting questions.

    However, Nailles’ question about the meaning of the two graphs above also reveals the perils of broader accessibility: visualizations can be powerfully persuasive tools that may conceal important aspects of the underlying data (such as what the data specifically represents, e.g., patterns of follower links vs. retweets).

    It would be interesting to know what the Tea Party graph above represents, e.g., whether it reflects follower links or retweets (or something else). A recent New York Times/CBS News Poll: National Survey of Tea Party Supporters showed that 45% of Tea Party supporters say they are more likely to trust information received from other supporters of the Tea Party movement than from television or newspapers … suggesting that retweeting among the like-minded may play an important role in disseminating information (and/or misinformation).

  • Deepak

    Good science builds hypothesis based on observation. When you have a lot of data, data mining becomes the set of observations that lead to a hypothesis which you then have to create an experiment around. That is the mistake Chris Anderson made in his Wired article (forgetting that you still need testable hypotheses). The part he got right was that there is a lot of value in data mining.

    Atul Butte’s group at Stanford has done some interesting work around using existing data for some pretty good science.

  • alkyseltzer

    A two beer evening? Wow – those guys really know how to party !

  • Alan Dulles

    The accuracy of the data set is not relevant, and the quality of the hypothesis is of little consequence. In our current world of psychically driven digital images, all that matters is the believability of the claims made via the result presented. Unfortunately, I fear that this will simply become a tool to enhance credibility, sellabilty and believability, whether it is merited or NOT.

  • Atul Butte

    Thanks for the plug in the comments.

    By absolute coincidence, I just wrote an article titled “Democratizing Integrative Biology” [PDF link here: ]

    IMHO, data is approaching commoditization, as are the computational tools, and the approaches for validation of generated hypotheses and findings. The real value remains in “turning the crank” quickly and for the right questions.

    — Atul