Tue

Apr 25
2006

Tim O'Reilly

Tim O'Reilly

Open Text Mining Interface

Timo Hannay of Nature has published a bit more information on their Open Text Mining Interface. I first saw this a few weeks ago when I was on a panel that Timo chaired at the Life Sciences Expo in Boston . It immediately struck me as "slap your forehead brilliantly obvious." A lot of people want full text access to journal articles (or books, for that matter). For example, a search engine wanting to understand what gene sequences are described in an article needs the full text, not just the abstract. But authors (and publishers) aren't keen on exposing the full text to all readers. OTMI solves this problem. It's an XML format that expresses the full text -- word vectors, plus "snippets" -- an alphabetically ordered sentence list -- for programs to do full-text search against. This way, a program can data-mine the full text of the article, but a human can't "read" it sequentially.

Now, it may be that like all forms of DRM, this will encounter user resistance from folks who believe in open access to everything. But I love the cleverness of this approach, which lets machines make use of the content in ways that human readers can't. I like it. You might consider it a "copyright hack."

Timo's group at Nature is one to watch. I'm also a fan of their Connotea project, which aims to bring del.icio.us-style tagging to scientific citations.


tags:   | comments: 4   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/4604

Comments: 4

  Deep [04.25.06 01:13 PM]

Fine unless you want to do more serious analysis of the text instead of simplistic keyword search. What about all the natural language processing folks who want to implement a cross sentence pronoun co-referencing algorithm for example? Keyword search is a transient phenomenon on route to much more sophisticated search.

  Andrea [04.25.06 02:54 PM]

On first glance, Connotea looks quite a lot like CiteULike which has been around since end of 2004. I don't have the time right now to do a proper comparison, but the features listed in the FAQs look more or less the same.

  Colm [04.26.06 10:45 AM]

For internal systems searching the Google Search Appliance (www.google.com/gsa) can store usernames and passwords for protected content while only serving up the abstract in the results.

This means it can apply it's powerful algorithim to the protected content.

Wouldn't that be easier than trying to get all the publishers to output in that format. Besides if it's machine readable then I'm sure it won't be long before it gets reverse engineered to a machine can convert it to readable text.

  Timo Hannay [04.27.06 01:45 AM]

Thanks for the comments.

Deep: I agree that you can't do all kinds of text mining with this format. The hypothesis we're testing is that you can do enough to make it worthwhile providing something like OTMI files. For example, in my very limited experience, many scientific text-mining studies seem to use rather simple approaches based on individual words or sub-sentence phrases. If such approaches are becoming so out-of-date as to be worthless then OTMI as it's currently proposed obviously wouldn't be worth doing. My current impression having spoken with a few scientists about it is that that's not the case, but I could be wrong.

Colm: The idea is to let people apply their own algorithms rather than Google's or anyone else's.

Andrea: Yes, CiteULike and Connotea were developed independently at around the same time (CiteULike launched 2-3 weeks before Connotea in around October 2004, I think). They are both inspired by del.icio.us and have similar but by no means identical functionality.

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.