Profile of the Data Journalist: The Long Form Developer

ProPublica developer Dan Nguyen is redefining how longform journalism is told through data.

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Dan Nguyen (@dancow) is an investigative developer/journalist based in Manhattan. Our interview follows.

Where do you work now? What is a day in your life like?

I’m a news app developer at ProPublica, where I’ve worked for about 3.5 years. It’s hard to say what our typical day is like. Ideally, I either have a project or am writing code to collect the data to determine whether a project is worth doing (or just doing old-fashioned reading of articles/papers that may spark ideas for things to look at). We’re a small operation so we have our hands on any of the daily news production as well, including helping the reporters put together online features for their more print-focused work.

How did you get started in data journalism? Did you get any special degrees or certificates?

I stumbled into data journalism because I had always been interested in being a journalist but double majored in journalism and computer engineering just in case the job market didn’t work out. Out of college, I got a good job as a traditional print reporter at a regional newspaper but was eventually asked to help with the newsroom’s online side. I got back into programming and started to realize there was a role for programming and important journalism.

Did you have any mentors? Who? What were the most important resources they shared with you?

The mix of programming and journalism is still relatively new, so I didn’t have any formal mentors in it. I was of course lucky that my boss at ProPublica, Scott Klein, had a great vision about the role of news applications and our investigative journalism. We were also fortunate to have Brian Boyer (now the news applications editor at the Tribune company) to work with us as we started doing news apps with Ruby on Rails, as he had come into journalism from being a professional developer.

What does your personal data journalism “stack” look like? What tools could you not live without?

In terms of day-to-day tools, I use RVM (Ruby Version Manager) to run multiple versions of Ruby, which is my all purpose tool for doing any kind of batch task work, text processing/parsing, number crunching, and of course Ruby on Rails development. Git, of course, is essential, and I combine that with Dropbox to keep versioned copies of personal projects and data work. On top of that, my most frequently used tool is Google Refine, which takes the tedium out of exploring new data sets, especially if I have to clean them.

What data journalism project are you the most proud of working on or creating?

The project I’m most proud of is something I did before SOPA Opera, which was our Dollars for Docs project in 2010. It started off with just a blog post I wrote to teach other journalists how web scraping was useful. In this case, I scraped a website Pfizer used to disclose what it paid doctors to do promotional and consulting work. My colleagues noticed and said that we could do that for every company that had been disclosing payments. Because each company disclosed these payments in a variety of formats, including Flash containers and PDFs, few people had tried to analyze these disclosures in bulk, to see nationwide trends in these financial relationships.

A lot of the data work happened behind the scenes, including writing dozens of scrapers to cross-reference our database of payments with state medical board and med school listings. For the initial story, we teamed up with five other newsrooms, including NPR and the Boston Globe, which required programmatically creating a system in which we could coordinate data and research. With all the data we had, and the number of reporters and editors working on this outside of our walls, this wasn’t a project that would’ve succeeded by just sending Excel files back and forth.

The website we built from that data is our most visited project yet, as millions of people used it to look up their doctors. Afterwards, we shared our data with any news outlet that asked, so hundreds of independently reported stories came from our data. Among the results were that the drug companies and the med schools revisited their screening and conflict of interest policies.

So, in terms of impact, Dollars for Docs is the project I’m proudest of. But it shares something in common with SOPA Opera (which was mostly a solo project that took a couple weeks), in that both projects were based of already well-known and long-ago-publicized data. But with data journalism techniques, there are countless new angles to important issues, and countless new and interesting ways to tell their stories.

Where do you turn to keep your skills updated or learn new things?

I check Hacker News and the programming subreddit constantly to see what new hacks, projects, and plugins that the community is putting out. I also have a huge backlog of programming books, some of them free that were posted on HN, on my Kindle.

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

I went into journalism because I wanted to be a longform writer in the tradition of the New Yorker. But I’m fortunate that I stumbled onto the path of using programming to do journalism; more and more, I’m seeing how important stories aren’t being done even though the data and information are out in broad daylight (as they were in D4D and SOPA Opera) because we have relatively few journalists with the skills or mindset to process and understand that data. Of course, doing this work doesn’t preclude me from presenting in a longform article; it just happens that programming also provides even more ways to present a story when narrative isn’t the only (or the ideal) way to do so.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.