O'Reilly Labs: Code Search and Content Stats

As part of our SafariU platform (which allows professors or trainers to build custom books using the entire corpus of books from Safari as a resource, mixing in their own materials at will), we’ve build a MarkLogic xquery database containing the source for all of our books.

So Ryan Grimm and Andy Bruno started asking themselves what else they could do with all that content. A couple of their initial projects are up on our new O’Reilly Labs site. The first, Code Search, lets you search through the more than 2.6 million lines of example code from almost 700 O’Reilly books. You can limit your search to a particular book, a particular category (e.g. Perl, or Java), or a particular author.

Documentation on the search syntax can be found on the Labs Wiki.

The Content Stats is probably less immediately useful except perhaps to content wonks, but is even cooler. Want to know how many total pages there are in all O’Reilly books? (309,647) How many examples? (123,439) Do our Java books or our Perl books have more lines of code per page, on average? (Java) How many lines? (14.76 vs. 10.97 for Perl.) How many index entries are there in an average O’Reilly book? (1,783) The stats are linked to the search box, so changing the search refigures the stats for the books matching the search result. There’s also a cool tag cloud of the most commonly appearing technical terms across all O’Reilly books… and clicking on a term takes you to a listing of all the books containing the term. From there, you can click to a content statistics page for each book.

We’re noodling ideas on how to build some of this into Safari, as well as oreilly.com. We’d love your ideas on other applications of these tools.