Programming Collective Intelligence

When Time Magazine picked “You” as their Person of the Year for 2006, they cemented the idea that Web 2.0 is about “user generated content” — and that Wikipedia, YouTube, and MySpace are the heart of the Web 2.0 revolution. The true story is so much more complex than that. The content that users contribute explicitly to Web 2.0 sites is the small fraction that is visible above the surface. 80% of what matters is below, in the dark matter of implicitly-contributed data.

In many ways, the defining moment of the Web 2.0 revolution was Google’s invention of PageRank, the realization that every link on the World Wide Web was freighted with hidden meaning: a link is a vote about the importance of a site. Understanding those votes, and the relative importance of the sites that were voting, gave better search results than merely studying the web pages themselves. It was the breakthrough that launched Google on its path to becoming the most important tech company of the new century. PageRank is now one of hundreds of implicit factors that Google uses in deciding what search results to feature.

No one would characterize Google as a “user generated content” company, yet they are clearly at the very heart of Web 2.0. That’s why I prefer the phrase “harnessing collective intelligence” as the touchstone of the revolution. A link is user-generated content, but PageRank is a technique for extracting intelligence from that content. So is Flickr’s “interestingness” algorithm, or Amazon’s “people who bought this product also bought…”, Last.Fm’s algorithms for “similar artist radio”, ebay’s reputation system, and Google’s AdSense.

I defined Web 2.0 as “the design of systems that harness network effects to get better the more people use them.” Getting users to participate is the first step. Learning from those users and shaping your site based on what they do and pay attention to is the second step.

There has been an enormous amount of programming creativity applied to developing new techniques for extracting meaning (another word for “intelligence”) from data. And more creativity is needed. We’re still at the beginning of the collective intelligence revolution. But that’s no excuse for re-inventing the wheel. So we decided to document the state of this emerging art.

Toby Segaran’s new book, Programming Collective Intelligence, teaches algorithms and techniques for extracting meaning from data, including user data. This is the programmer’s toolbox for Web 2.0. It’s no longer enough to know how to build a database-backed web site. If you want to succeed, you need to know how to mine the data that users are adding, both explicitly and as a side-effect of their activity on your site.

There’s been a lot written about Web 2.0 since we first coined the term in 2004, but in many ways, Toby’s book is the first practical guide to programming Web 2.0 applications. (We won’t tell you how to be the next Google, but we’ll teach the basic techniques that are part of the price of entry. Better or more specialized algorithms are going to be the heart of each web 2.0 company’s secret sauce.)

Take a look at the table of contents (or better yet, the book itself), and let me know what you think.