Profile of the Data Journalist: The Long Form Developer

ProPublica developer Dan Nguyen is redefining how longform journalism is told through data.

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Dan Nguyen (@dancow) is an investigative developer/journalist based in Manhattan. Our interview follows.

Where do you work now? What is a day in your life like?

I’m a news app developer at ProPublica, where I’ve worked for about 3.5 years. It’s hard to say what our typical day is like. Ideally, I either have a project or am writing code to collect the data to determine whether a project is worth doing (or just doing old-fashioned reading of articles/papers that may spark ideas for things to look at). We’re a small operation so we have our hands on any of the daily news production as well, including helping the reporters put together online features for their more print-focused work.

How did you get started in data journalism? Did you get any special degrees or certificates?

I stumbled into data journalism because I had always been interested in being a journalist but double majored in journalism and computer engineering just in case the job market didn’t work out. Out of college, I got a good job as a traditional print reporter at a regional newspaper but was eventually asked to help with the newsroom’s online side. I got back into programming and started to realize there was a role for programming and important journalism.

Did you have any mentors? Who? What were the most important resources they shared with you?

The mix of programming and journalism is still relatively new, so I didn’t have any formal mentors in it. I was of course lucky that my boss at ProPublica, Scott Klein, had a great vision about the role of news applications and our investigative journalism. We were also fortunate to have Brian Boyer (now the news applications editor at the Tribune company) to work with us as we started doing news apps with Ruby on Rails, as he had come into journalism from being a professional developer.

What does your personal data journalism “stack” look like? What tools could you not live without?

In terms of day-to-day tools, I use RVM (Ruby Version Manager) to run multiple versions of Ruby, which is my all purpose tool for doing any kind of batch task work, text processing/parsing, number crunching, and of course Ruby on Rails development. Git, of course, is essential, and I combine that with Dropbox to keep versioned copies of personal projects and data work. On top of that, my most frequently used tool is Google Refine, which takes the tedium out of exploring new data sets, especially if I have to clean them.

What data journalism project are you the most proud of working on or creating?

The project I’m most proud of is something I did before SOPA Opera, which was our Dollars for Docs project in 2010. It started off with just a blog post I wrote to teach other journalists how web scraping was useful. In this case, I scraped a website Pfizer used to disclose what it paid doctors to do promotional and consulting work. My colleagues noticed and said that we could do that for every company that had been disclosing payments. Because each company disclosed these payments in a variety of formats, including Flash containers and PDFs, few people had tried to analyze these disclosures in bulk, to see nationwide trends in these financial relationships.

A lot of the data work happened behind the scenes, including writing dozens of scrapers to cross-reference our database of payments with state medical board and med school listings. For the initial story, we teamed up with five other newsrooms, including NPR and the Boston Globe, which required programmatically creating a system in which we could coordinate data and research. With all the data we had, and the number of reporters and editors working on this outside of our walls, this wasn’t a project that would’ve succeeded by just sending Excel files back and forth.

The website we built from that data is our most visited project yet, as millions of people used it to look up their doctors. Afterwards, we shared our data with any news outlet that asked, so hundreds of independently reported stories came from our data. Among the results were that the drug companies and the med schools revisited their screening and conflict of interest policies.

So, in terms of impact, Dollars for Docs is the project I’m proudest of. But it shares something in common with SOPA Opera (which was mostly a solo project that took a couple weeks), in that both projects were based of already well-known and long-ago-publicized data. But with data journalism techniques, there are countless new angles to important issues, and countless new and interesting ways to tell their stories.

Where do you turn to keep your skills updated or learn new things?

I check Hacker News and the programming subreddit constantly to see what new hacks, projects, and plugins that the community is putting out. I also have a huge backlog of programming books, some of them free that were posted on HN, on my Kindle.

Why are data journalism and “news apps” important, in the context of the contemporary digital environment for information?

I went into journalism because I wanted to be a longform writer in the tradition of the New Yorker. But I’m fortunate that I stumbled onto the path of using programming to do journalism; more and more, I’m seeing how important stories aren’t being done even though the data and information are out in broad daylight (as they were in D4D and SOPA Opera) because we have relatively few journalists with the skills or mindset to process and understand that data. Of course, doing this work doesn’t preclude me from presenting in a longform article; it just happens that programming also provides even more ways to present a story when narrative isn’t the only (or the ideal) way to do so.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Jim Mulvaney

    Yes, investigative reporting requires more data skill but investigative reporting also requires the 5 Ws and H skill, also requires people skills of identifying sources who can point one toward promising areas for dissection. As a former member of one of the most famous investigative groups, Newsday’s Greene Team, I can tell you that the model (on which IRE etc. are based) is to use the combined skills of a number of professionals. I have substantial skills with data analysis (as a government official developed the prototype for examining civil rights compliance in federal ARRA contracts) but I combine it with traditional reporting. The ARRA project came about as a result of getting a tip (a guy called to say that all the contruction workers on a major federal project were white), checking it out (I conduct surveillance–out of the 1,000 plus workers there appeared to be maybe 50 folks of color–then attacking the data. The state is still developing the project and I am no longer involved. But I still believe the best projects are the ones that start from an allegation rather than data cruncher’s “what if” sessions.

  • http://www.propublica.org Dan Nguyen

    I skipped mentioning them by name or in detail since this Q&A was a focus on purely the data-backed means of journalism, but there is no doubt the Dollars for Docs project would not have been successful without the lead reporters, Charlie Ornstein and Tracy Weber. In fact, it would not have really taken off at all because Charlie and Tracy, with their experience in reporting health-care, knew best what direction the project should go.

    Reporting from a database isn’t a replacement for traditional journalism, I’m arguing that it is a skill deserving of the same emphasis as interviewing/writing/FOIAing/etc. in the journalism curriculum. This is both because so much of our information is now distributed in a digital format and we have the capability to better parse/analyze info of all kinds with the power of the tools and computing we have available.

    There are a good number of well-trained programmers out there, and yet relatively few of them go into public data analysis. And even fewer undertake civic projects because, as you say, it’s hard to produce strong work from “what if” sessions. They need the insights that come from covering a beat.

    At the same time, there are a great number of stories waiting to be done that are practically under the noses of adept journalists. But they won’t be done because of technical hurdles, some of which are extremely trivial.

    What’s the proper mix between journalist and programmer? I’m in the camp that hopes that basic programming becomes part of school curriculum — just as basic Excel/Wordprocessing usually are. In the meantime, there needs to be better recognizance that the means to collect/parse/analyze data are extremely valuable to the core of the journalism profession, rather than just a side skill.

    This isn’t just limited to the field of journalism of course, the scientific, business, and legal fields, among many other research-intensive professions, would greatly benefit from better data and programming skills.

  • Patrick Mattimore

    Annually, since 2005, the College Board has been releasing “data” on the results of the previous year’s Advanced Placement (AP) exams. Every year the CB reports that more students passed the exams and that greater percentages of students passed the exams. Each state dutifully picks up the story and emphasizes its own state results which generally highlight increases in that state. The AP program has been expanding at a rate of about 8-10% per year for at least the last ten years, meaning that the College Board administered approximately twice as many exams in 2011 as they did in 2001. What the media are not reporting (and apparently don’t understand) is the fact that the passing percentages (based on test takers) actually declines by about .5% per year. The claimed percentage increase in passing exams is based on a state’s graduating seniors (whether they took an AP exam or not). Because newspapers don’t report the increasing failure percentages, based on test takers, it is in schools’ interests to push more and more kids to take AP classes and tests since failures on the tests are no different than non-participation. What’s more the passing percentage declines are even steeper for minorities since the best students, who might have taken 4 or 5 AP classes in high school ten years ago, are now taking 7 or 8 and “helping” the overall passing percentage. The number of 1′s on the exam (the lowest possible score) has doubled in the last ten years indicating that unprepared students are replacing marginally prepared ones. Unfortunately, the College Board’s increased success narrative is about the only one that gets into the news because reporters are not taking the time to crunch the data.

  • http://k12newsnetwork.com Cynthia

    I’d like to extend a challenge to data journalists to dig more deeply into the claims of “value-added measures” of teacher effectiveness as linked to student standardized test scores.

    That includes more critically analyzing AP scores (as mentioned above), but also filtering the release of VAM through more discerning looks at what the data can reasonably be made to say.

    For example, Bruce Baker over at School Finance 101 has parsed the NYC school teacher data carefully and it seems his takeaway (or my understanding of it) is this:

    Journos need help going beneath shallow “error rate” critiques of released data; VAM may only be slightly more predictive in student math standardized test scores as linked to teachers than ELA (which is to say still, hardly at all); and even so, the largest data set contains 18,000 data points, some of whom may be teachers counted twice. By contrast there are 150,000 teachers in NYC. So the data that has been released may only apply to 10% of the teaching workforce in NYC.

    This isn’t about better data viz to explain VAM scores, it’s about being critical of the magical thinking invested in data in the first place as if it weren’t possible to lie with statistics.

    “Data-driven” approaches to education policy are now being implemented with little public vetting and despite a very large body of evidence showing these approaches are unproven. (Evaluate the PE or art teacher based on math scores what??) Please help in that effort — the public sorely needs reliable translation of what seemingly authoritative charts and spreadsheets have to say.