Wed

Sep 19
2007

Tim O'Reilly

Tim O'Reilly

Followup on Semantic Web approach versus Web 2.0

I wrote the other day about the different approaches the Semantic Web and Web 2.0 take to building intelligent applications.

Meanwhile, the contrast between the Semantic Web and Web 2.0 also cropped up recently in two talks reported on in Ian Mulvany's blog entry about Barcamp Cambridge.

The rough theme of the meetings was tools for science, however there was a nice diversity in the topics presented. Matt Wood opened with a discussion of the semantic web for science. The gist of his argument is that there are two types of semantic web. There is the Semantic Web with capitals that comes with all of the specifications in place, full RDF and support for all of the machinery that goes with this. For the sort to medium term he identified two significant problems with this. The first is that most researchers don't have the inclination to learn all of the machinery to work with outputting data in this format it is miraculous enough to get them to work with well formatted HTML, but more on that later. The second issue is that getting funding in science to do Semantic-Web related projects is hard. Funding bodies at the moment, outside of computer science, just don't want to go there. His solution is to use the lowercase semantic-web. This means adding minimal amounts of micro-formatting to HTML documents, and creating a marketplace of markup. If your system is good, it could gain de facto acceptance in a me-too way. Put it out there because it is easy to put it out there, and if it is good it will be used. (Bioformats.org from Matt is an attempt to do just that with microformats for biology). Standardisation can come later, or not. In the Q & A a danger to this approach was pointed out where the domain experts may loose control of the translation of the de facto standard into a standard ontology if when that process happens they leave it to the computer science people.

...

Peter Corbett talked about teaching computers to understand text. He described himself as someone who had a desk at the computer lab and at the chemistry lab. Now he works on computational linguistic chemistry with the aim of auto-detecting language in chemistry papers to try to recognize chemicals and then auto-markup these papers. The idea is to supplement the mark-up from publishers. His system can also draw the chemical and annotations and overlay them on the paper Some problems that they encounter is are that there can be new names in papers, compact names, include extra hyphens, his program can deal with these kinds of things to a certain extent. You can go from plain text to something like a connection layout using an information rich markup The RSC is using this software along with human-cleanup to create markup of chemistry papers. The hope is that you can then do semantic search over papers. One of the gems from his talk is he described a small natural language processing trick. Imagine we were interested in opiates, we could just ask google "opiates" but if you take into account the structure of language and you search for phrases like "opiates such as" you will get a much better result in your search. There are many patterns like this, and I think he said that they are known as Hirst patterns, though I may have misheard this. He did a pass over abstracts on pubmed for these kind of patterns to make a network of relationships. It turns out that you can do reasoning on structure as well as processes using this analysis. A few bits of wisdom from his work was that most of the information has come from biochemists rather than chemists, more biologists are into open science, and open databases. Chemistry has been mostly captured by commercial interests, and it is hard to get free chemistry data. It is important to define what you are looking for so that you can evaluate how well the software has done, and it's important to remember that in a lot of text there is a difference between what you think the world looks like and how it is described in the literature.


tags: web 2.0  | comments: 6   | Sphere It
submit:

 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/5855

Comments: 6

  Search‚óä Engines WEB [09.19.07 07:13 AM]

But I'm confident that in the end, Web 2.0 and the Semantic Web are going to meet in the middle and become best friends.


Is this implying a compromise of the technologies to be harmonious.

They do not have to meet in the middle, they are exclusive technologies that do not inherently hinder one another - so each can develop as far as it is able to and not affect the development of the other.

The Semantic Web is how the author perceives the document to be, thematically with the intent of making it accessible to other resources via an internal, mutually agreed upon and understood communication.

The Social Web is allowing the public to evaluate the theme and the quality of the document, then externally distribute the information to other information resources.

The only disharmony that could occur is the differences in the theme and quality evaluation among the document's Author and the public readership.

An additional 'layer' could be conceivable in the future - that would sit on top of the document or house the document - that would 'inject' the public's perception into the document's Semantics.



In essence this is what search engines do now - they acknowledge the Title and Meta tags given by the document's author - as well as the backlink anchor text given by those who link to it and making a distinction between the quality of the backlinks.


BTW::
(((( The Search Engine ASK and Teoma attempted to carry this on step further in the earlier in this decade.
They came up with 'Subject Specific Rank' or 'Expert Rank' which differentiated between knowledgeable backlinks and common layman's backlinks.



It was not so focussed on the overall authority of the backlink - but on the ability of the page hosting the backlink to make an well informed judgment about that specific topic it was backlinking to - regardless of its common authority or authority on separate issues.



DirectHit sought to combine early social elements in the 1990s by tracking 'Click Popularity'; they were acquired by Ask.



Now search engines are using all of these elements to varying algo degrees.
))))

  Deepak [09.19.07 04:42 PM]

Tim,

Great post. It's something we constantly struggle with as an industry as well. On one hand, there are many developing fantastic semantic standards and approaches that can be very powerful for the sciences. On the other hand you have an army of people for whom XML looks like hieroglyphics.

An example of this can be seen with mmCIF, a format for representing protein structures. Structural biologists are used to the very old school PDB format (columns of data), while mmCIF is a data dictionary for representing the same. While, in some cases dictated by RCSB, there has been a move to support mmCIF, in general there has been great resistance from the scientific community to make the move.

As we discussed during Andrew Walkingshaw's session at Scifoo, I believe the battles will be won by providing a layer between the semantic web and the data; a layer that allows the non-experts to not have to worry about semantic tags and the likes, but worry about the science, and a bunch of data/ontology/software geeks to tackle the semantics. Of course, in time there will be a generation of researchers who will be weaned on data standards and ontologies (at least in biology).

In that regard, you are right, there will be a convergence between the user experience focussed approach of web 2.0 and the back end semantics that WILL drive life science forward.

  JP [09.19.07 09:49 PM]

They are called "Hearst Patterns" after Prof. Marti Hearst at U.C. Berkeley I-school. The relevant reference is:

Hearst, M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proc. of the
Fourteenth International Conference on Computational Linguistics, Nantes, France.

  Bardo N. Nelgen [09.20.07 10:45 PM]

Being a Semantic Web project leader myself, one major difference between the Semantic Web (narrower sense) and Web 2.0 to me seems that the latter for most people (including myself) is describing an outcome or at least a resulting type of application, while the first one obviously is just one of many available vehicles to achieve this outcome.

Using the Semantic Web as a way to build a Web 2.0 application has obvious disadvantages:

  1. You need much longer to get your pants on: There are about 1,400 pages of standards and methods between you and your first app and even more documentation assumed to be missing for the tools you are about to use in order to get it programmed…
  2. Your learning curve is pretty steep, even until you and your crew just got the most basic concepts. We ended up, splitting the tasks of programming (JAVA recommended) and data-modelling/markup-writing between different people, as it turned out that both tasks required a quite different mindset.

On the other hand it turned out, that with all of the additional work come some quite unexpected results:

  1. If you do it properly, you really only got to do it once. We actually found ourselves reusing our first creations quite early, as well as deploying those provided by other people with ease. The Semantic Web's consequent standardization approach really allows you repurposing your stuff quite early — just like many others claimed it before and you never got there.
  2. Semantic Web data clearly takes away the pain from sharing data accross company- or other technical boundaries, because you already got your processing in place for whatever is going to come in from out there or vice versa.

Our result:

If you are really about to create a single-domain application with only a limited need to exchange data with the outside (such as users ubloading files or developers submitting a bunch of parameters), most likely the Semantic Web approach will be a waste of production time and therefore money (at least until better developer tools become available).

Nevertheless, the more different and independent (!) from each other the various parties are (for instance when you may be building an exchange or trading platform) who are supposed to use the resulting applications, the more it's probably worth considering to go through the accompanying hassle and get your feet wet with Semantic Web technologies.

  Grant Barrett [09.21.07 06:02 AM]

Hearst patterns are more complex than necessary for the work I typically do, but in lexicography I use "collocations," which are similar, defined by the New Oxford American Dictionary as "the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance." At the most basic level, I use more than 800 of them in automated Google Alerts in my hunt for new and newish words.

What interests me most in reading about the Semantic Web over the last few years is how closely it resembles the large-scale projects that dictionary-makers have undertaken to take their "flat" dictionaries--that is, those that are completely pre-digital, or else are digital but are mostly straight text with styling information--and convert them to complex XML written to custom DTDs. I've been a part of one major project and can agree with Bardo completely when he says "If you do it properly, you really only got to do it once" and "If you are really about to create a single-domain application with only a limited need to exchange data with the outside...most likely the Semantic Web approach will be a waste of production time."

  Bardo N. Nelgen [09.22.07 12:01 AM]

Thanks for the support, Grant. ;-)

Some currently argue that the Semantic Web (which I am admittedly very passionate about…) will never become real or at least useful, because it would need to many people to translate everything on the web and in the world into semantic expressions.
   But who talked about everything ? Prominent non-semantic applications like Wikipedia or even search engines' 'suggest' features have shown us, that enough to be useful can be reached within several thousands, rather than millions or billions of entries.
   Which is (with regard to the web's global scale) not very much actually… — especially as the Semantic Web technology has already been adopted for real applications (often just for internal use) by companies such as Adobe, Vodafone, Audi (the carmaker) and a bunch of well-known others.

This seems to me being quite similar to the early XML adoption at the end of last century: No-one really knew if this was going to be useful or just an IT fad as they had already seen so many.

So let's handle Semantic Web technology just like we did it back then with XML: Wait, and see what people are going to figure out which purposes this is usable for… :-) :-)

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS