How I automated my writing career

A former author uses data and software to take the tedium out of some kinds of writing.

In 2001, I got an itch to write a book. Like many people, I naïvely thought, “I have a book or two in me,” as if writing a book is as easy as putting pen to paper. It turns out to be very time consuming, and that’s after you’ve spent countless hours learning and researching and organizing your topic of choice. But I marched on and wrote or co-wrote 10 books in a five-year period. I’m a glutton for punishment.

My day job during that time was programming. I’ve been programming for 16 years. My whole career I’ve focused on automating the un-automatable — essentially making computers do things people never thought they could do. By the time I started on my 10th book, I got another kind of itch — I wanted to automate my writing career. I was getting bored with the tedium of writing books, and the money wasn’t that good.

But that’s absurd, right? How can a computer possibly write something coherent and informative, much less entertaining? The “how can a computer possibly do X?” questions are the ones I’ve spent my career trying to answer. So, I set out on a quest to create software that could write. It took more effort than writing 10 books put together, but after building a team of 12 people, we were able to use our software to generate more than 100,000 sports-related stories in a nine-month period.

Before I get into specifics with what our software produces, I think it’s worth highlighting some of the attributes that make software a great candidate to be a writer:

  • Software doesn’t get writer’s block, and it can work around the clock.
  • Software can’t unionize or file class-action lawsuits because we don’t pay enough (like many of the content farms have had to deal with).
  • Software doesn’t get bored and start wondering how to automate itself.
  • Software can be reprogrammed, refactored and improved — continuously.
  • Software can benefit from the input of multiple people. This is unlike traditional writing, which tends to be a solitary event (+1 if you count the editor).
  • Perhaps most importantly, software can access and analyze significantly more data than what a single person (or even a group of people) can do on their own.

Software isn’t a panacea, though. Not all content can be easily automated (yet). The type of content my company, Automated Insights, has automated is quantitatively oriented. That’s the trick. We’ve automated content by applying meaning to numbers, to data. Sports was the first category we tackled. Sports by their nature are very data heavy. By our internal estimates, 70% of all sports-related articles are analyzing numbers in one form or another.

Our technology combines a large database of structured data, a real-time feed of stats, and a large database of phrases, and algorithms to tie it all together to produce articles from two to eight paragraphs in length. The algorithms look for interesting patterns in the data to determine what to write about.

In November of 2010, we launched the StatSheet Network, a collection of 345 websites (one for every Division-I NCAA Basketball team) that were fully automated. Check out my favorite team: UNC Tar Heels.

Automated game recap
Software mines data to construct short game recaps. (Click to see full story.)

We included the typical kind of stats you’d expect on a basketball site, but also embedded visualizations and our fully automated articles. We automated 14 different types of stories, everything from game recaps and previews to players of the week and historical retrospectives. Recently, we launched similar sites for every MLB team (check out the Detroit Tigers site), and soon we are launching sites for every NFL and NCAA Football team.

Sports is only one of many different categories we are working on. We’ve also done work in finance, real estate and a few other data-intensive industries. However, don’t limit your thinking on what’s possible. We get a steady stream of requests from non-obvious industries, such as pharmaceutical clinical trials and even domain name registrars. Any area that has large datasets where people are trying to derive meaning from the data are potential candidates for our technology.

Automation plus human, not automation versus human

Creating software that can write long-form narratives is very difficult, full of all sorts of interesting artificial intelligence, machine learning and natural language problems. But with the right mix of talent (and funding), we’ve been able to do it. It really does take a keen understanding of how software and the written word can work together.

I often hear it suggested that software-generated prose must be very bland and stilted. That’s only the case if the folks behind the software write bland and stilted prose. Software can be just as opinionated as any writer.

A common, and funny, question I get from journalists is: “when will you automate me out out of a job?” I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That’s just not going to happen. What’s happening instead is that media companies are using our software to help scale their businesses. Initially, that takes the form of generating stories on topics a media outlet didn’t have the resources to cover. In other cases, it means putting our stories through an editorial process that customizes the content to the specific needs of the publisher. You still need humans for that. There will be less of a need for folks to spend their time writing purely quantitative pieces, but that should be liberating. Now, they can focus on more qualitative, value-added commentary that humans are inherently good at. Quantitative stories can — and probably should — be mostly automated because computers are better at that.

Software will make hyperlocal content possible and even profitable. Many companies have tried to solve the “hyperlocal problem” with minimal success. It’s just too hard to scale content creation out to every town in the U.S. (or the world, for that matter). For certain categories (e.g. high school sports), software-generated content makes perfect sense. You’ll see automated content play a big role here in the coming years.

Software-generated books?

Because I’ve been so focused on running Automated Insights, I haven’t had time to write any new books recently. I suggested to a colleague that we should turn our software loose and have it write my next book. He looked at me and asked, “How can it possibly do that?” That’s what I like to hear.

But is a software-generated book even feasible? Our software can create eight paragraphs now, but is it possible to create eight chapters’ worth of content? The answer is “yes,” but not quite the same kind of technical books I used to write, at least right now. It would be easy for us to extend our technology to write even longer pieces. That’s not the issue. Our software is good at quantitative analysis using structured data.

The kind of books I used to write were not based on data and were qualitative in nature. I pulled from my experience and did supplemental research, made a judgment on the best way to perform a task, then documented it. We are in the early stages of building software that will do more qualitative analysis such as this, but that’s a much harder challenge. The main advantage of today’s usage of software writing is to automate repetitive types of content. This is less applicable for books.

In the near term, the writers at O’Reilly and elsewhere have nothing to worry about. But I wouldn’t count out automation in the long term.

Associated box score photo on home and category pages via Wikipedia.

Related:

tags: , , , , , ,
  • Aaron

    This is totally fascinating.
    Thank you for sharing.

  • http://andirog.blogspot.com Anil Gupta

    Great article. Just curious how difficult would it be in your opinion to identify characters and storylines in a fiction using computers.

    I have been thinking about figuring out ways to automate summarizing fiction or non-fiction. What type of data mining and machine learning techniques would help here to at least have software generate a rough draft that can be polished by human?

    Thanks

  • A. Nonny Mouse

    Great, so you’ve taught a machine to play Buzzword Bingo – Sports Edition.

  • Rudolf Olah

    What’s wrong with unions? Seriously, most “content farms” pay a pittance, there’s a reason that writers demand more.

    I wouldn’t mind software generated content because then writers could focus on writing the types of books they want rather than what pays the bills.

  • http://celsius.ws celsius

    i’m fascinated by this emerging area, and specifically how automated insights is pushing it along.

    so, for those areas without data sets — when can we expect bots to start running phone interviews to build up fresh data? i’ve never been much into sports.

  • http://www.runfatboy.net Jim Jones

    I’ve used Mechanical Turk for iterative, story creation. The plots tend to evolve organically and are quite disjointed, but can be fun to read (depending on the creativity of the Turkers).

    Email me if you’re interested. jim.jones1@gmail.com

    Oh on a side note, I’m also the developer of the Ruby/Rails gem Turkee, that makes integrating Rails with Mechanical Turk a breeze.

    http://www.github.com/aantix/turkee

  • http://markos.gaivo.net/blog/ Marko Samastur

    “I wouldn’t mind software generated content because then writers could focus on writing the types of books they want rather than what pays the bills.”

    They write what pays the bills because what they want doesn’t. Why would you think that taking that away from them would free them to write books that still wouldn’t pay (much)?

  • http://avocadopress.com Puranjay

    Robbie

    Can work for tech writing.

    No computer can create a One Hundred Years of Solitude though. That’s art. That’s the realm of the human.

  • Mitchp

    This is better than Google’s email autopilot feature which was an April Fools joke. I want this.

    Also, Go Heels!
    I bet we win the championship again this year.

    UNC Class of 2008

  • http://www.clickandinc.com/blog Sarah

    This is incredible! I’m reminded of my local modern art museum’s exhibit where visitors could type back and forth with an artificial intelligence — I visited about a dozen times until the exhibit moved on. I’m no programmer, but I’m fascinated by this technology. Thanks to you and your team for helping to turn our future into one Arthur C Clarke would be proud of.

  • http://www.esoftcoder.com mike

    Very interested so you just specify topic you want to write and software will come up with new book in couple minutes. Would you tell software how many pages you want ant etc …

  • Max Max

    Hi – my name’s Max max. I’m a student in economics and literature in Montreal. I first want to point out that this article has significant hilarity in the literary world. Have you ever read Gulliver’s travels? You’ve been preemptively warned 250 years ago.

    I can understand how writing software to sift data and translate it into language can make a lot of sense.

    In fact, you could almost envision this step as “article chart”, a new feature on Excel. Is it more than another way of expressing your quantified data in a way that is more stylistically accessible to an increasingly numbers-adverse population?

    And sure, I agree with you that this may allow numerical data to be collected and turned into article form at low cost for increasingly local communities with their own sets of importances, encouraging the decentralization of information and the emergence of an increasingly organic and resilient society.

    However, your example seems to suggest more of the mechanical nature of sports articles and “online content writing” than the possibility of a machine ever replacing genuine human authorship. At its very best, this would be a lense, a sort of mould, that individual programers could widdle for themselves and create a tool through which their individual voice could be leveraged over a wide range of domains, hopefully increasingly aware of the subtle nuances from one to the next. Is this going to encourage the quantification of unquantifiable variabes? Is this going to magnify the perceived importance of the ones already circulating through the veins of our communicable culture, adding to the chaos?

    I entertain myself some days glancing over Elance proposals and craigslist writing gigs and see tons of requests for “write content for my website – paid by word, or 500 word articles, using specific keywords”.

    Are these very little more than acts of online money-grabbing from people who hope that by creating “content”, a large enough volume of people will have their information channeled through them to justify an ad revenue? This belies an even greater joke – these content creators are intentionally creating content that they claim (in order to sway readership) is helpful while at the same time intending for you to not be so interested in it (that it is somehow trivial) that you would avoid clicking on their ads.

    My point is that if you think you can create a system that replaces the necessity people feel of writing mindless reports on quantitative sports data in their socially destructive attempt at an employment that pisses on the most interesting technology humans have going for their survival right now, then by all means go ahead, it’s the fulfillment of an ancient myth. besides, then hopefully these people can start turning their minds towards more creative and humanly worthwhile ventures. I wonder how much food is wasted feeding these people.

    But If the result is that suddenly we have an exponential explosion of written noise to sift through as it mingles with our own consistent level of voices as every uprising local news source passes it off under the pretense of worthwhile human authorship, we’re going to find ourselves informationally dependant no longer on centralized media, but on decentralized sources creating a uniform product.

    it’s kind of like how all dubstep uses the same synth sounds but there’s so much of it.

    Ooo alright well my argument is falling apart in many directions here… but I’m really interested in this, curious enough to let you open the pandora’s box. (then again, I’m sure you’re also aware that the written products can be no more than the magnification of one specific rule for creativity, something we only have backwards evidence for).

    I’m worried more for the deeper issue of how we’ve monetized the internet as a society and what the impacts of this will be on our cultural economy.

    And if we’re really witnessing a digital renaissance, where corporate funds are finally going towards worthwhile artistic and cultural production, then will this project just be a diversion of these limited resources for artists towards the possibly artless owners of local content generators? Will you be creating media powerhouses out of machines?

  • http://jeffmcneill.com/blog/ Jeff McNeill

    Very interesting. Obviously there is a certain kind of writing which this applies to. However, it seems that the data would already have to be available for it. Still, very interesting…

  • Danielle

    I too find the journalists question funny, because if that was the case, why do we have several daily papers writing about the same news, with varied readership? It’s down to the journalists writing style. By automating content you are adding another writers style into the mix, not replacing anyone.

    I find this a fascinating piece and am intrigued to see further developments.

  • http://www.indianacarinsurancequoteonline.com/sr50-filing Dhanna Chung

    I like the concept of Software plus Human, it’s the best idea to keep an idea on how to work on automation. Software is really vital to lessen up writing efforts specially on marketing your post but I guess it all depends on how you are managing them, yes, manage software and don’t rely solely on it. thanks for a great post.

  • Bob

    I did significant work trying to get something like this to work in the past but didn’t have the vision to limit to data heavy content.

    Now that I see it working with data heavy content, I have to laugh. This software — which took hundreds/thousands of hours to create — churns out 8 paragraphs, and someone out there is working right now to create software that takes those 8 paragraphs, and condenses it down to 1.

    So much wasted time =p

  • http://www.elizabethmoon.com EMoon

    Back in the day, we used to play with getting an IBM 1401 to write “poetry” (sort of Beat stuff) [not of course our assigned work...but hey, a computer and a printer, how fun is that?] Moving on into modern times, there are some automated science sites, including one reporting earthquakes in California. (Same format for each post, variable data automatically input from seismograph.) There’s plenty of software designed to help people write anything from term papers to screenplays (plotting software, organizing software, etc.)

    Though I make my living writing (not in this field) I think I won’t panic just yet.

  • http://www.glamquotes.com/quote/i-am-who-i-am-quotes/ Jenny Blake

    This is funny! I also work as a writer, but somehow, I’m not nervous about this taking over. It is a very cool product though.