I wrote the other day about the different approaches the Semantic Web and Web 2.0 take to building intelligent applications.
Meanwhile, the contrast between the Semantic Web and Web 2.0 also cropped up recently in two talks reported on in Ian Mulvany’s blog entry about Barcamp Cambridge.
The rough theme of the meetings was tools for science, however there was a nice diversity in the topics presented. Matt Wood opened with a discussion of the semantic web for science. The gist of his argument is that there are two types of semantic web. There is the Semantic Web with capitals that comes with all of the specifications in place, full RDF and support for all of the machinery that goes with this. For the sort to medium term he identified two significant problems with this. The first is that most researchers don’t have the inclination to learn all of the machinery to work with outputting data in this format it is miraculous enough to get them to work with well formatted HTML, but more on that later. The second issue is that getting funding in science to do Semantic-Web related projects is hard. Funding bodies at the moment, outside of computer science, just don’t want to go there. His solution is to use the lowercase semantic-web. This means adding minimal amounts of micro-formatting to HTML documents, and creating a marketplace of markup. If your system is good, it could gain de facto acceptance in a me-too way. Put it out there because it is easy to put it out there, and if it is good it will be used. (Bioformats.org from Matt is an attempt to do just that with microformats for biology). Standardisation can come later, or not. In the Q & A a danger to this approach was pointed out where the domain experts may loose control of the translation of the de facto standard into a standard ontology if when that process happens they leave it to the computer science people.
Peter Corbett talked about teaching computers to understand text. He described himself as someone who had a desk at the computer lab and at the chemistry lab. Now he works on computational linguistic chemistry with the aim of auto-detecting language in chemistry papers to try to recognize chemicals and then auto-markup these papers. The idea is to supplement the mark-up from publishers. His system can also draw the chemical and annotations and overlay them on the paper Some problems that they encounter is are that there can be new names in papers, compact names, include extra hyphens, his program can deal with these kinds of things to a certain extent.
You can go from plain text to something like a connection layout using an information rich markup The RSC is using this software along with human-cleanup to create markup of chemistry papers. The hope is that you can then do semantic search over papers. One of the gems from his talk is he described a small natural language processing trick. Imagine we were interested in opiates, we could just ask google “opiates” but if you take into account the structure of language and you search for phrases like “opiates such as” you will get a much better result in your search. There are many patterns like this, and I think he said that they are known as Hirst patterns, though I may have misheard this. He did a pass over abstracts on pubmed for these kind of patterns to make a network of relationships. It turns out that you can do reasoning on structure as well as processes using this analysis. A few bits of wisdom from his work was that most of the information has come from biochemists rather than chemists, more biologists are into open science, and open databases. Chemistry has been mostly captured by commercial interests, and it is hard to get free chemistry data. It is important to define what you are looking for so that you can evaluate how well the software has done, and it’s important to remember that in a lot of text there is a difference between what you think the world looks like and how it is described in the literature.