We make the software, you make the robots

Get notified when our free report “Evaluating Machine Learning Models: A beginner’s guide to key concepts and pitfalls,” by Alice Zheng, is available for download.

Superpixels example from Andreas Mueller’s thesis paper (PDF), used with permission.

A few weeks ago, I had the pleasure of sitting down (virtually, over Skype) with Andreas Mueller, core developer and maintainer of the popular scikit-learn machine learning library. We had previously bonded over our shared goals of making useful machine learning software, so I jumped at the chance to interview him.

Mueller wears many hats at work. He is one of the key maintainers of the popular Python machine learning library scikit-learn. Holding a doctorate in computer vision from the University of Bonn in Germany, he currently works on open science at New York University’s Center for Data Science. He speaks at conferences around the world and has a fanbase of 5,000+ followers on Twitter and about as many reputation points on Stack Overflow. In other words, this man has got mad street cred. He started out doing pure math in academia, and has now achieved software developer cult idol status.

In person, he is unassuming, quiet of demeanor, and always wears a shy smile. He becomes animated when the conversation turns to software, and his eyes light up when someone mentions unit testing. This is pretty much how we met. Last December, he gave a talk about scikit-learn’s engineering principles at the Software Engineering for Machine Learning workshop at the International Conference on Neural Information Processing Systems (NIPS) in Montréal, Canada. The opening slide of his talk said:

Goal:
High quality, easy to use machine learning library.
Keep it usable, keep it maintainable.

With that, he got my attention. At the time, I was leading machine learning tool kits development at GraphLab (we’ve since changed our name to Dato). Our team spent weeks debating APIs and processes to make our software “easy to use,” “high quality,” and “maintainable.” Having gone through it first hand, I knew how important yet difficult it was to achieve those simple sounding goals. With every slide of his talk, I nodded along, silently saying “YES!” It was like finding a long-lost brethren in the midst of a snowstorm.

In preparation for the interview, I had come up with a list of questions and shared them with him. But when the time came, we decided it’d be more fun to not follow the script.

Alice Zheng: Let’s start from the beginning. How did you pick computer vision as a field of study?

Andreas Mueller: That’s a funny story. My master’s degree was in pure math, with no numbers at all, just abstract algebra. When I finished that, I took a little break and went on holidays. I thought, hmm, what do I want to do now? Math is nice but it’s really hard. I want to do a Ph.D. in math, but it’s not very applied and doesn’t impact the real world. I can’t even explain what I’m doing to the other math students. There are about five people in the city that I can explain it to. So, it was a bit frustrating. Then I saw a news article on a German newspaper about a robot soccer team from my alma mater, the University of Bonn, that won the robot soccer world cup. I’ve always liked robots, so I thought, maybe I should build robots. So, I went to the robotics professor about a Ph.D. position. He told me, “Yeah robots are fun but they break all the time. You’re a mathematician, and I don’t want you soldering, fixing the hardware, and changing the battery. You should work on the algorithms.” So he gave me some topics, and I ended up doing computer vision and machine learning.

AZ: Funny, that’s how I ended up doing machine learning, too. I visited a bunch of robotics labs and decided I didn’t want to spend five years soldering. So, algorithms it was. (Laugh)

So, what was your experience like with computer vision? I think of computer vision as an application area for machine learning. Were you more on the data-handling application side or the machine learning methods side?

AM: The research that I did was quite complicated on the modeling side. My goal was to solve the computer vision problem. I was given a list of data sets, I chose to work on the image segmentation problem, then I went deeper and deeper into the methods, which is how research worked in a way. My knowledge and my background are more on the machine learning side.

AZ: What kind of image segmentation methods did you work on?

AM: Conditional Random Fields based on superpixels.

AZ: What are superpixels?

AM: Superpixels became popular about eight years ago. It’s kind of an abstraction of an image. If you want to reason about an image, raw pixels are a bad unit because they are not really there in the world. So, you try to break up an image into parts that look similar, that you are very certain would belong to the same object. Whatever reasoning I want to do, I can do them on top of superpixels. You basically over-segment your image into hundreds or thousands of small pieces that are all compact, and then do your reasoning.

AZ: Is it done algorithmically using some kind of clustering algorithm?

AM: Yeah, basically it’s done using clustering. There’s a k-means algorithm and a mean shift algorithm. If you look in scikit-image, there are four of them, one of which I wrote.

AZ: It’s an interesting switch to go from academia to full-time software maintenance and support. Why did you make that choice, and how do you feel about it today?

AM: My first job after graduate school was at Amazon as a computer vision scientist. I developed software for a production system. When I started working on scikit-learn, I felt that this is the largest impact I could have. I was very familiar with the codebase, so it was easy for me to make contributions. It was a vitally important codebase. A lot of people use it, and I get a lot of good feedback from people who like it. I’m really happy because I feel that creating and maintaining this software package helps a lot of people.

AZ: How did you get started with scikit-learn?

AM: I kinda used it during my Ph.D. in computer science. I was using Python, and I was looking for tools to do easy prototyping for my research. Scikit-learn was basically the best one and the easiest one to use, so I started using it. At some point, they had a sprint at NIPS in Grenada, Spain (2011). I did a couple of very simple pull requests and asked them, “Hey, can I come to this sprint? You said you had funding?” So, they flew me to Grenada to attend NIPS. I participated in the sprint, and at the end of it they said, “Hey, our maintainer is finishing his position. We need a new maintainer. Do you want to do it?” And, uh …

AZ: You make it sound so easy. How did you manage to do that? Did you commit a whole bunch of stuff during the sprint?

AM: I did a couple of fixes before. But I think they liked that I took care of some detailed questions, did bug fixes and some new algorithms. I wasn’t just adding new stuff.

AZ: So, you are a full-stack maintainer who’s good with community support.

AM: Yeah. I wasn’t just answering questions on Stack Overflow. There were a lot of issues on GitHub, and I took care of many of the small annoyances that were affecting users.

AZ: Other than Stack Overflow, do you follow other online forums or blogs for machine learning?

AM: I use Twitter a lot. Andrej Karpathy (@karpathy) has amazing blog posts and interesting Tweets. Jake VanderPlas (@jakevdp) also writes blog posts. John Cook has some great stuff.

AZ: What would be your advice to someone who’s just starting out in data science? Any tips?

AM: It’s hard for me to say. I see myself more as a machine learning researcher than a data scientist. I’m not very good with Hadoop, which you might want to be if you are a data scientist. There are many different routes. Having a strong programming background is probably also helpful.

There are a lot of data science academies and week-long workshops. But a two-week training may not be enough to get you a job in industry. There are some master’s programs now. Soft plug: NYU has a master’s program for data science. It’s geared toward industry. People come in with varying backgrounds. They teach you programming, statistics, big data technologies, machine learning, and visualization. The idea is that you’ll be able to take a job in the industry as a data scientist after graduation.

AZ: Is it based on coursework or projects?

AM: Both. You get to do analysis on many data sets. You get a lot of practical experience. It seems more realistic to do this in two years rather than two weeks.

Also, some people start out as programmers and transition into data science on the job. It’s a good way to go if it’s available.

AZ: It’s interesting that you called yourself a machine learning researcher, not a data scientist. In your view, what’s the difference between data science and machine learning?

AM: (Laugh) You know, at work [the Center for Data Science at New York University], there is this big thing written on the wall: “What is data science to you?” We have researchers who target this question. Here’s something that I wrote down that many people seem to agree with: machine learning targets the methods, algorithms, and structured tasks. It tries to answer the questions “what is the task we’re trying to solve? How do we measure success?” Machine learning assumes that the data set is given along with labels.

Data scientists or data engineers, on the other hand, go to the database, extract the schema, and maybe even write the script to log the data in the first place. I feel that data science is more software engineering oriented. It’s also much closer to the data, whereas machine learning people think more about the methods.

AZ: That’s a great definition. Data science is more about software engineering, sits closer to the data, and machine learning concerns itself with the methods. It’s very concise.

In three sentences, can you describe your job?

AM: I work on open source tools to help researchers apply machine learning and do data analysis so they don’t have to worry about software too much. I want to educate people on how to use the tools in the right way, how to do proper documentation, testing, version control, how to build software that’s maintainable and do experiments that are repeatable.

AZ: What does a typical work day for you look like? What are the usual tasks?

AM: I usually start out by going through my emails. Many of the contributors to scikit-learn live in Europe. So, while I sleep in New York, there would about a hundred notifications per day accumulating on GitHub. I’ll go through all of the notifications, catch up on what happens in the GitHub issues, make comments, do code reviews, etc. Then I would check in with my students who are working on projects. Then I’ll meet with other researchers to discuss their projects, or work on my own parts of the scikit-learn code. There are 300 open pull requests in scikit-learn, so I’ll work on those and do bug fixes. Then I go on Stack Overflow and do support for the scikit-learn tag.

AZ: How did you pick between Python and R when you started out? Why did you go with Python and scikit-learn?

AM: Actually, I wasn’t aware of R. It’s very community-specific. Some communities are very R-driven. But in the wider computer science community, people rarely talk about R. It’s mostly by statisticians. Also, I’d already started doing stuff with Python. I wrote my own neural network, and my lab was using Python a lot. I liked Python, so I looked for tools in Python.

AZ: Interesting. What about Matlab? I remember that to be the dominant tool in computer vision.

AM: Yeah, I also used Matlab. It’s the reason I reimplemented the segmentation algorithms in scikit-image — they were only available in Matlab then. People in computer vision are switching from Matlab to Python, I think. Part of that is due to scikit-image. But it’s also because people in the industry don’t want to pay for Matlab licenses.

AZ: A lot of tools are being built in this space. How do you see scikit-learn in relation to all the other tools?

AM: My personal opinion is slightly different from some other developers. I think scikit-learn is best in a single machine environment. I find it to be very handy along with Pandas and IPython Notebook for iterating quickly on a single machine. If you have a lot of data, you can work on a big EC2 instance with a lot of space. You can even put it in production if you wanted to.

AZ: What does production mean to you?

AM: If you are a researcher, then maybe it means running experiments. If you are in industry, then it might be putting the trained model behind a Web server to make predictions.

If I have more than a terabyte of data, I probably wouldn’t pick scikit-learn. It’s very hard to get a terabyte of RAM, and it probably won’t be fun. You can do out-of-core computations with scikit-learn. But a terabyte is probably on the border of what your hard disk can handle.

AZ: What kind of out-of-core computation can you do with scikit-learn?

AM: You can do linear models, some clustering, some PCA. You can stream data over the network into a single machine. Scikit-learn has a partial_fit function that allows the model to update a little bit given new data. But you’d still need to write the rest of the infrastructure yourself.

So, if you have a lot of data, you’d probably need to distribute it, either by writing your own from the ground up, or use other libraries like GraphLab Create, MLlib, or H2O. They are all separate ecosystems of distributed algorithms, and they all look pretty cool. I haven’t looked into them that much because my data is not that big.

There’s probably a ways to go [for the other tools] in terms of usability and performance. If you look at the random forest benchmark against MLlib, scikit-learn still looks pretty good compared to distributed algorithms. But it’s a single data set. More benchmarks are probably needed.

AZ: It’s pretty easy to come up with benchmarks where either distributed or single machine come up short. I think the key question is graceful degradation (or transition) from in-memory to out-of-core to distributed.

AM: Yeah, and I think large data requires very different architecture behind the scenes. Scikit-learn is optimized for smaller data. It makes sense to have different tools for different tasks. If they have the same interface, it’d be nice. But it’s not necessary.

AZ: What do you think this tool ecosystem needs more of or less of?

AM: Maybe less Machine Learning as a Service (laugh). There are already 500 of them. It’s very hard to see what they are doing, and it’s hard to understand where is the value-add. Your stuff (Dato Core) is open source, H2O is open source, Spark is open source. It’s very easy to see the algorithm development and the value-add. But for products that just put a GUI on top of scikit-learn, it’s unclear what happens underneath or how they contribute to the community.

We need more and better open source data exploration and visualization tools. There are a couple of cool visualization projects going on in the Python and Javascript world, but there’s no common solutions for visualization yet. Those that are available have their limitations: going from Python to d3 is hard, and d3 is too low level. Having easier ways to visualize data is nice.

Being able to do better data munging would also be nice. We have Pandas and SFrame and RDDs, but you still need to do a lot of coding. Data cleaning and munging should be more automatic and more easily accessible. For instance, someone showed me his data and described what he wanted to do. I pulled out my notebook, did some joins and groupby’s in Pandas, and solved his problem in 20 minutes. He said, “OMG, that would have taken me two months because I don’t know how to do this.” I think it would be great if he didn’t have to come to me or read [Wes McKinney’s] book.

AZ: That’s a hard problem to solve. Raw data and use cases are so different. It’s hard to design automated tools that can do the matching automatically. And even if you built it, how would you tell people what it does without making them read a textbook?

AM: You’d need to write interfaces that allow people to describe what they need to do at an abstract level. It’s a super hard problem. A lot of people are trying to solve it. But maybe even more people should be trying to solve it because it’s so hard.

AZ: Earlier you talked about the need for more transparency and less blackbox software. To you, is this part of the appeal of open source software? What is the importance of open source?

AM: One reason why it’s important is to open up research to others. If you have research software in closed-source or for-profit libraries, then it’s very hard for others to reproduce the research. Basically, they have to buy the product so they can see that you did the right thing, or do the thing for themselves. Also, open source allows scientists to do their research without having to pay. If all the tools are free and open, then everybody has the same chance to do exploration, machine learning on their data, whether they are in sociology, biology, or computer science.

There are also benefits for commercial applications. But I’m coming from the open research direction.

AZ: Thank you, Andreas, for a fun interview! It’s great to hear about how you got started in computer vision and scikit-learn, and to find out your thoughts on data science versus machine learning and why you care so much about open source software.

AM: Thank you, Alice! It’s been a pleasure.

We make the software, you make the robots

An interview with Andreas Mueller, on scikit-learn and usable machine learning software.

Get the O’Reilly Data Newsletter