Aug 15

Tim O'Reilly

Tim O'Reilly

Programming Collective Intelligence

cover of Programming Collective Intelligence

When Time Magazine picked "You" as their Person of the Year for 2006, they cemented the idea that Web 2.0 is about "user generated content" -- and that Wikipedia, YouTube, and MySpace are the heart of the Web 2.0 revolution. The true story is so much more complex than that. The content that users contribute explicitly to Web 2.0 sites is the small fraction that is visible above the surface. 80% of what matters is below, in the dark matter of implicitly-contributed data.

In many ways, the defining moment of the Web 2.0 revolution was Google's invention of PageRank, the realization that every link on the World Wide Web was freighted with hidden meaning: a link is a vote about the importance of a site. Understanding those votes, and the relative importance of the sites that were voting, gave better search results than merely studying the web pages themselves. It was the breakthrough that launched Google on its path to becoming the most important tech company of the new century. PageRank is now one of hundreds of implicit factors that Google uses in deciding what search results to feature.

No one would characterize Google as a "user generated content" company, yet they are clearly at the very heart of Web 2.0. That's why I prefer the phrase "harnessing collective intelligence" as the touchstone of the revolution. A link is user-generated content, but PageRank is a technique for extracting intelligence from that content. So is Flickr's "interestingness" algorithm, or Amazon's "people who bought this product also bought...", Last.Fm's algorithms for "similar artist radio", ebay's reputation system, and Google's AdSense.

I defined Web 2.0 as "the design of systems that harness network effects to get better the more people use them." Getting users to participate is the first step. Learning from those users and shaping your site based on what they do and pay attention to is the second step.

There has been an enormous amount of programming creativity applied to developing new techniques for extracting meaning (another word for "intelligence") from data. And more creativity is needed. We're still at the beginning of the collective intelligence revolution. But that's no excuse for re-inventing the wheel. So we decided to document the state of this emerging art.

Toby Segaran's new book, Programming Collective Intelligence, teaches algorithms and techniques for extracting meaning from data, including user data. This is the programmer's toolbox for Web 2.0. It's no longer enough to know how to build a database-backed web site. If you want to succeed, you need to know how to mine the data that users are adding, both explicitly and as a side-effect of their activity on your site.

There's been a lot written about Web 2.0 since we first coined the term in 2004, but in many ways, Toby's book is the first practical guide to programming Web 2.0 applications. (We won't tell you how to be the next Google, but we'll teach the basic techniques that are part of the price of entry. Better or more specialized algorithms are going to be the heart of each web 2.0 company's secret sauce.)

Take a look at the table of contents (or better yet, the book itself), and let me know what you think.

tags: web 2.0  | comments: 27   | Sphere It

Previous  |  Next

0 TrackBacks

TrackBack URL for this entry:

Comments: 27

  Michael R. Bernstein [08.15.07 10:32 PM]

The ToC is actually making my fingers itch to play with some of this stuff!

I just skimmed through the sample chapter (Chapter 4: Searching and Ranking), and I am impressed with the quick but rigorous progression through the topic. Also, the code examples are good, being short, simple, practical, and illustrative, without being dumbed down. If it is representative of the rest of the chapters listed in the TOC (some of which tackle much more complex techniques) I'd say you've got a real winner here.

SQLite is a good choice (in fact an obvious choice) for an example RDB if you're going to use one in examples. I could quibble with whether it was really necessary for these examples to use a relational DB at all or whether persistent BTrees (or even just in-memory data structures) would have worked better pedagogically, but I'm sure even more folks would have complained over the absence of SQL, so... [shrug]

The sample chapter has a notice on it that I hadn't noticed on your sample chapters before which I found a bit curious:

This excerpt is protected by copyright law. It is your responsibility to obtain permissions necessary for any proposed use of this material. Please direct your inquiries to

Shouldn't that be "for any proposed use of this material beyond the bounds of Fair Use"?

  Josh Spaulding [08.15.07 10:36 PM]

Very very interesting. I'm really curious what Toby's views are on web 2.0

Tim, your definition of Web 2.0 is pretty accurate, yet many think differently.

Extracting meaningful data and analyzing it is really the heart of the increasing technology we have these days!

If you can't see how people/things react to an action, you won't know how to improve.

Great article!

  L√©on van Berlo [08.15.07 11:17 PM]

Wow... nice table of content.
I thought that decision trees and Bayesian statistic probably (yes, probably) wouldn't be in there so i was positively surprised that they are!
Looks like a very cool and complete book. I'm going to buy it!
By the way... what does this comment do to the intelligence of this website... (think about it!)

  Aneesha [08.16.07 02:39 AM]

This is my must have book for the year. I look forward to reading the svm/kernels content.

  alex de jong [08.16.07 03:32 AM]

Great post. We've just finished deploying our reb me app on facebook, and figured that for us that is only the beginning, as so much of what we want to do is about mobilizing the collective intelligence of our users. We figured that for us to add meaningful features for our own community tracking and userdata mining would not be enough if we couldn't also rely on the social intelligence that is web2.0 (awful term, but there you go)

  geekr [08.16.07 04:53 AM]

Looks very promising! It would be interesting to implement these algorithms in Erlang which scales well on multi-core and distributed systems:

  James Schwahn [08.16.07 07:49 AM]

Great article, and I agree with you on many of your points. The algorithms that bring users content relating to their interests is web 2.0 to me. Ajax is web 2.0 in my eyes also, and many sites use it well and some not so well.

I'll be checking out that book.

  Ajeet Khurana [08.16.07 08:15 AM]

Sounds exciting. In a sense, web 2.0 is about simplification. But, what acts as simplification at the user end can involve some pretty nifty programming at the back end.

I look forward to reading this book.

Incidentally, though I agree with the context in which it was stated, it still sounds paradoxical -- "No one would characterize Google as a "user generated content" company." True Google is the company and the algorithm. But, its content, the search results, are entirely user-created. The access to the content is what Google provides.

  Daniel Raffel [08.16.07 09:13 AM]

Wow, looks like a great read! Can't wait to add it to my own O'Reilly bookshelf. Congrats on publishing what looks to be another great, timely title. And, great work Toby!!

  Alex Tolley [08.16.07 10:09 AM]

The sample chapter on search looks quite good - a nice blend of domain specific coding and algorithm. If the rest of the book is similar, I would consider it a good addition to my library which contains a number of books on AI and ML, but which are general texts about the algorithms.

  Vasudev Ram [08.16.07 11:53 AM]

Sounds like an interesting book. Going to check it out ...

Vasudev Ram

  Paul [08.17.07 07:51 AM]

So would it be faster to order through O'Reilly or Amazon? They both come out to about the same price for 3 books...

  Tim O'Reilly [08.17.07 07:57 AM]

Paul -- good question. I would guess that it would be about the same either way, but I don't order books from O'Reilly :-) so I don't have actual experience. I do know that books from Amazon usually come more quickly than they say on their website.

  Paul [08.17.07 08:36 AM]

Thanks Tim, I went with Amazon since they said they had it in stock already ;)

Looking forward to the books!

  Allen Noren [08.17.07 10:34 AM]

Hello Paul,

Depending where you are, standard shipping from Amazon and O'Reilly are the same, as are the expedited services. Our buy 2, get 1 free w/ free shipping is a great value, and we appreciate direct orders.

--Allen Noren

  Anonymous [08.18.07 07:40 AM]

This looks like a great and useful book. It seems a shame, however, that there aren't sections on both privacy and data security. The book covers the sorts of tools that reward a little more thought and care early on, ideally during the software design phase.

Sure, privacy concerns are an issue for all software design and I wouldn't expect to see a privacy section in, say, a Nutshell book. But "Collective Intelligence" is about data mining, about figuring out where to draw the line between private customer data and monetized corporate assets. It's worth thinking about the difference between what you can do (technical) what you're permitted to do (legal) and what you ought do (PR, long term v. short term profits, accepted norms, ethics, etc.)

I look forward to reading the book. Since it sounds like it doesn't matter to you one way or another, I'll order from Amazon. May as well drive the sales stats up there. :-)

  Tim O'Reilly [08.18.07 09:19 AM]

Anonymous --

Really good point about privacy and data security concenrs. I will definitely forward your comment to the author. You're right that it's so easy to treat technical topics in isolation from policy and social concerns. Good catch.

Meanwhile -- we *do* love it when people order direct from us. We make a lot more money. But we also love it when people order from retailers. And you're right that for a publisher and author, driving up amazon rank can be worth a lot. But my guiding principle on this is one I wrote up in support of independent retailers a few years ago, Buy where you shop

  Michael R. Bernstein [08.18.07 04:56 PM]

Ah, that jogged something loose from my brain. One specific implementation topic that relates to privacy and security is 'translucent databases', ie. hiding the detailed data (even from yourself, potentially) while giving still giving unrestricted ad-hoc query access to the aggregate. A chapter on that would be sweet.

  Anonymous [08.19.07 11:01 PM]

The estimated shipping time is October on Amazon and I can not find shipping date information when trying to order on

I do not mind paying more when I "buy where I shop", b/c it usually means I can get it right away. In this case I hesitate because it is not clear that I can get it right away (and the price difference is more than %30).

  Allen Noren [08.20.07 08:47 AM]

Hello Michael,

The book is available from and should be from Amazon in the next couple days. This book qualifies for free shipping from, and you can also use our OPC10 code in our cart to buy two books and get the third free.

  Michael R. Bernstein [08.20.07 12:46 PM]

Hmm. It looks like Radar needs to deal more elegantly with anonymous posts. As it is, the indications that an anonymous post is NOT part of the previous named post are too subtle.

Using some variation on 'Anonymous Coward' as a placeholder for a missing name would solve this.

  Tim O'Reilly [08.21.07 03:58 PM]

For those of you wondering whether to buy from Amazon or directly from O'Reilly, I heard from our Amazon sales rep that Amazon is temporarily out of stock, and is in fact showing as "not yet published." She wrote in email:

"Since I've heard from several of you regarding Programming Collective Intelligence and the status on Amazon, I thought I better send out a quick little note to explain the "glitch." As most of you have seen, Amazon's detail page for Programming Collective Intelligence is now showing as a pre-order but just last week it was "available." ... Here's what happened as it's been explained to me. Apparently, Programming Collective Intelligence ran out of stock as quickly as it was received in. Because it ran out of stock so close to the expected pub date, the system threw it back into a pre-order status."

  Chris Andres [09.23.07 10:07 PM]

Having worked for a couple of different internet start-up companies which employed “Web 2.0” methodology, this book looks really interesting. The surge of user generated content has really changed the ways that new web companies operate and look to make money. I will be interested what this book has to say about user feedback loops and other algorithms for enhancing websites based on new technologies and Web 2.0 principles.

  John Erickson [10.17.07 10:54 AM]

This is a superbly practical book; I'm recommending it to everyone --- at least this week!

I received the book yesterday and starting working through the python-based examples in the evening. I've mostly skimmed the textual material, because much of it I am familiar with due to my own research. The text I have read is well-done and explains things MUCH more efficiently than the research papers I've trudged through (professors take note!)

My only complaint (so far) is that due to practical constraints, the reader shouldn't stray too far from the example data structures (such as "critics" in the "Making Recommendations" chapter) until they have verified the basic functionality and their understanding.

There are cases where one might tweak a structure and the code will seem broken but it ISN'T; it's merely the case that the examples have been set up to give good results on small datasets.

Finally, it wasn't clear where on the book website to download the example code from. I gave up and just bang it in...

  J.O. Urban [10.29.07 01:01 PM]

"harnessing collective intelligence" - Tim O'Reilly

I think you hit it right on the dot with that one. The leaders of tommorrow will be those who are able to intergrate this passive network intelligence into their technology and online business processes.Those that develop systems which can harness this intelligence and exploit its apparent banefits will be the most successful in my opinion.

  Niraj J [11.04.07 04:46 PM]

Enterprise Context to Collective intelligence and NEtwork effect.

  Niraj J [11.09.07 09:00 AM]

Tim ,

On your note about Buy where you shop

Your advice while correct will not be taken in my the masses. The reality is that a lot of people are going to utilize the service of browsing through the stores and shop online.

The shopkeeper needs to figure out better ways of monetizing the shop experience. This is the same challenge YHOO is facing , while being the most visited site , people do not click on the addvertisements. Who should take the blame for this. The Customer OR (YHOO). I think YHOO , Blaming the customer is not going to get you anywhere.

Maybe the bookstore shopkeeper should understand that if he made dollars from the book business he is going to make dimes now as a result of online competition. Maybe he should start selling Premium coffee in his store to make up on the margins. Maybe he should keep a few samples of everybook and not stock a lot and offer the same discounted prices for customers who want to shop online at his online store channel.

The same problem exists in open source world.
check out my thoughts on monetization of open source at

Post A Comment:

 (please be patient, comments may take awhile to post)

Type the characters you see in the picture above.