O'Reilly Labs: Code Search and Content Stats
As part of our SafariU platform (which allows professors or trainers to build custom books using the entire corpus of books from Safari as a resource, mixing in their own materials at will), we've build a MarkLogic xquery database containing the source for all of our books.
So Ryan Grimm and Andy Bruno started asking themselves what else they could do with all that content. A couple of their initial projects are up on our new O'Reilly Labs site. The first, Code Search, lets you search through the more than 2.6 million lines of example code from almost 700 O'Reilly books. You can limit your search to a particular book, a particular category (e.g. Perl, or Java), or a particular author.
Documentation on the search syntax can be found on the Labs Wiki.
The Content Stats is probably less immediately useful except perhaps to content wonks, but is even cooler. Want to know how many total pages there are in all O'Reilly books? (309,647) How many examples? (123,439) Do our Java books or our Perl books have more lines of code per page, on average? (Java) How many lines? (14.76 vs. 10.97 for Perl.) How many index entries are there in an average O'Reilly book? (1,783) The stats are linked to the search box, so changing the search refigures the stats for the books matching the search result. There's also a cool tag cloud of the most commonly appearing technical terms across all O'Reilly books... and clicking on a term takes you to a listing of all the books containing the term. From there, you can click to a content statistics page for each book.
We're noodling ideas on how to build some of this into Safari, as well as oreilly.com. We'd love your ideas on other applications of these tools.
tags:
| comments: 8
| Sphere It
submit:
0 TrackBacks
TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/1414
Comments: 8
[08.25.06 09:31 AM]
hi tim,
nice experiment -- maybe, your competitors will figure out "THE SECRET" recipes for your successful books ?
BR,
~A
[08.25.06 09:51 AM]
Hey nice work guys! How about a consolidated index? Or uber-glossary.
I'd also like to see the O'Reilly content decorating other content, like maybe the Linux Documentation Project.
[08.25.06 06:54 PM]
Code search will be immensly helpful - please develop it - This is long overdue
[08.25.06 11:09 PM]
Love the code snippets!
Up to now, it has been a total pain to search for the right book in Safari. The book titles and pictures of the book covers from the search results are close to useless from an user's perspective - the result is just not specific enough. Even if I navigates into the book using the tree menu (which btw is dog slow), the preview, i.e. the intros in each sections, convey very little insight about the quality of the actual content. I end up using Amazon's customer reviews most of the time to determine which books I want to put into my bookshelf.
Code search is definitely a huge improvement over the existing book search in Safari. Thanks!
[08.28.06 12:39 AM]
Interestingly, Safari has been providing Code Search for several years, as well as Amazon editorial reviews and ranking. Are these features just not visible enough?
PS. I couldn't agree more with Tony's suggestions for an uber-index and uber-glossary! I think they're already on Safari's roadmap...
[09.12.06 09:21 AM]
This is a fantastically useful tool, and the fact that you don't have to have Safari to use it makes it even better.
The main weakness I've found so far is that there's no obvious way to limit by programming language. Unless there's something I'm missing, searching for C# code is like looking for a needle in a haystack.
[09.14.06 12:16 PM]
>>searching for C# code is like looking for a needle in a haystack.
not really, have you tried
"cat:csharp select",
"cat:csharp Array" in other words
cat:csharp SearchTerm
Michael Bernstein [08.25.06 08:43 AM]
Huh. This is pretty interesting stuff. The first thing I tried to find was a way to search by license. I believe at least some of the books are licensed under the GFDL or Creative Commons licenses, but couldn't figure out how to pull up that subset.