On the importance of imagination in data science

Amy Heineike

According to Amy Heineike, the Director of Mathematics at Quid, there’s nothing like having a fresh dataset in R and knowing how to use it. “You can add a few lines of code and discover all kinds of interesting information,” Heineike says. “One question leads to another, you get into a flow, and you can have an amazing exploration.”

Heineike started working with data several years ago at a consultancy in London, where “playing around” with data shed light on the impact of social networks on government policies. Part of her job was figuring out what types of data to use in order to find solutions to crucial problems, from public transportation to obesity. Her day-to-day work at Quid entails working with new data sets, prototyping analytics, and collaborating with an engineering team to improve data analysis and bring products into production.

At the Strata Santa Clara, she spoke with me about the importance of imagination in data science, using visualizations as a tool, and how data teams can work better together.

Can you talk a bit about how the team at Quid uses maps and visualizations to explore data?

Amy Heineike: Because we are living so much of our lives online, more and more of our collective conversations are happening through blog posts, social media, news articles, web pages, or government filings which end up online. This includes lots of really messy, unstructured, interesting, rich material.

A lot of the tools that are commonly available to systematically evaluate content online makes the process painful and difficult. Our challenge is to make visual maps of the data that you would otherwise have to consume by reading every single piece of it. Our maps aren’t geographic or spatial, they’re topical. It’s not latitude and longitude that you point to on the maps, it’s an idea.

It’s well-known that math is a crucial competence in the data science field. What other attributes do you think data scientists need to be effective?

Amy Heineike: I think it’s important that people in this kind of role care a lot about what they are building. They should have the imagination to think about the person who might use what they are building and what they might want to achieve.

In that sense, data scientists are a bit like product managers. Product managers figure out what features should go into a website or software tool. Data science is similar to that, but when you’re thinking analytically, the question is, “Can I really get data that can build this?” and “Does the data support whatever it is I want to build?” Being able to see how things fit together is really important.

It’s also the case that data is almost inevitably messy and hard to work with. And so learning how to look at data and understand the shape of it is important. Are there weird artifacts in the data? Or issues that you need to clean up? Are there strange things in the data that actually turn out to be the most interesting things?

I think the real magic comes from being able to realize that a product that you want to make, something that you want to build, could actually be made from either data that you have lying around, or data that you can fetch.

What tools do you use?

Amy Heineike: At Quid, we built a whole stack that starts off by pulling in big data sets that are going to be really important for answering big questions. The news, information about start up companies, basically anything we can grab from online and process. So we have a platform for sucking that in, and that’s using several different tools and making use of different APIs.

We then have a platform for storing this data and indexing it, so we make use of a lot of elastic search at this point internally, to be able to access all the data.

Then we have the analytics engine and visualizations tools. There are actually a lot of bits to that. We use Python extensively and we’ve been playing around with a couple of different technologies on the visualization side. I used to use R extensively, but not so much anymore, which makes me sad because it’s fun!

What capabilities are missing from the tools that you use? Are there instances where the tools that are available to you fall short of what you need them to do?

Amy Heineike: Even with tools that are relatively straightforward like R and Python, there is a pretty steep learning curve before you arrive at what’s possible. What this means is that you could specialize in using the tools, but don’t have much time to spend with the people who are using what you built. Or you spend a lot of time with people who are using what you built, and you don’t have enough time to master the tools. So, I think that’s one challenge.

At Quid, one of the reasons we like the idea of mapping and putting data in a format where people can come and explore it is that the they don’t have to touch Python, they don’t have to worry about where the data came from, and they don’t have to clean it up. People are able to just participate and to ask a lot of questions.

This interview was edited and condensed.

Interested in nominating someone to be interviewed for a Strata Community Profile? Let us know!

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA

Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England