Previous  |  Next

Mon

03.12.07

Tim O'Reilly

Tim O'Reilly

Different Approaches to the Semantic Web

I've been thinking a bit more about why I'm more excited about Metaweb's freebase than I have been about previous Semantic Web projects that I've been exposed to.

I think that part of it is the difference in how they capture data about relationships. A good example is Semantic MediaWiki, which Stefano Mazzocchi pointed me to. They capture relations in a very explicit way, in this case, using structured wikitext. For example, as the wikipedia page on Semantic MediaWiki explains, an entry about Berlin might include the wikitext:

 ... the population is [[population:=3,993,933]]

resulting in the output text "the population is 3,993,933" and the hidden semantic tuple "'Berlin' 'has population' '3993933'".

It seems easy enough, but why hasn't this approach taken off? Because there's no immediate benefit to the user. He or she has to be committed to the goal of building hidden structure into the data. It's an extra task, undertaken for the benefit of others. And as I've written before, one of the secrets of success in Web 2.0 is to harness self-interest, not volunteerism, in a natural "architecture of participation."

By contrast, in freebase, an entry about Germany would show an explicit form intended to capture critical statistics about a location. What's so clever is that by articulating the types as a separate structure from the data, and having instances inherit that structure when they are created, users don't think they are providing metadata -- they think they are just providing data.

Because anyone creating a new instance is prompted to fill out the data in a structured way, that it doesn't seem like an extra task, but rather that the software is being helpful. Any data field can be left blank, but it can also easily be updated by anyone else who cares to do so. And in fact, applications that don't explicitly present themselves as Semantic Web applications, like the Web 2.0 family tree maker, Geni, work exactly the same way. The user is given an opportunity to create a very structured entry that doesn't feel like a chore but just the natural way to perform the task.

And these applications are fairly addictive. Go try Geni, and I bet that before long, you've got your whole family at work on it. (I do.) Similarly, once freebase is open to the public, everyone will be busily constructing entries on themselves, their businesses, their products -- all the commercial activity that is explicitly banned from wikipedia. (Freebase's supersetting of wikipedia is very clever in this way. If you've got a wikipedia entry, it's included. But if not, you get to write about yourself.)

That being said, one of the other sites that Stefano pointed me to, dbpedia, shows what is ultimately perhaps an even more powerful tool, namely extracting implicit structure from data already out there on the web where it can be found.

This was the vision of a company called Invisible Worlds, founded by Carl Malamud and Marshall Rose before the dotcom bust. (I was on the board.) Carl and Marshall realized that there were many types of semi-structured data out on the web -- e.g. the Edgar database of the SEC -- and that it was possible to extract that structure and make it more accessible.

This is what dbpedia is doing with Wikipedia, and what Adrian Holovaty is doing with Chicagocrime.org, but it's also what Google is doing with PageRank.

This is the true Web 2.0 way: don't ask users to provide structure, unless it's useful to them. But do design your applications in such a way that structure is generated without extra effort on the user's part. And mine structure that already exists, even if it's messy and inefficient.

Consider one of my favorite examples, the still to-be-built Web 2.0 address book. A formal Semantic Web approach might say: "Users should create FOAF files, expressing their social network." But the fact that I have someone in my address book already expresses a relation! And all the data that could be collected (as Google collects and analyzes web data) expresses even more detail. How often do I contact this person? by phone? by email? by IM? by SMS? How quickly do a respond? What topics do I tend to share with which of my correspondents? All of these heuristics, properly collected, would provide a far more powerful "friend of a friend" network than anything built explicitly with FOAF. (The first big communications company -- email, phone, or IM -- that does this right will have a killer app!)

We might call this the "small s semantic web" by contrast with the formal Semantic Web.

Now it may be that FOAF will be the right mechanism for expressing the data that my imagined Google-style peoplefinder collects, but I tend to doubt it. PageRank and all the other factors that Google uses probably could be expressed using Semantic Web style syntax, but I somehow doubt that they are.

The semantic web is definitely coming. But it's coming through different mechanisms, I think, than the Semantic Web pioneers imagined.



tags:   | comments: 37   | Sphere It
submit:

 

1 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/1864

This is the true Web 2.0 way: don’t ask users to provide structure, unless it’s useful to them. But do design your applications in such a way that structure is generated without extra effort on the user’s part. And mine structure that... Read More

Comments: 37

Simon   [03.12.07 05:26 AM]

"The still to-be-built Web 2.0 address book."? Plaxo works for me...

Guy   [03.12.07 05:34 AM]

I think applications have always captured typed information (hopefully) in a way that is natural to the user. I think the challenge is to make that information accessible and discoverable to people that use a different structure, ment for an unanticipated purpose. That is what the Semantic Web is trying to solve, and I doesn't think anyone has a good answer for that yet.

Martin Bittner   [03.12.07 06:27 AM]

Hi,

I think the very first link of that post is broken.

Cheers!

Michael R. Bernstein   [03.12.07 09:30 AM]

You are probably right, in terms of large-scale siloed apps, but if any of this stuff is going to be exportable, importable, and interoperable, RDF and several other of the Big-S Semantic Web technologies will come in very handy, to say the least.

Which is far from a forgone conclusion, of course.

Bob Boydston   [03.12.07 09:33 AM]

Another big area is that of semantics. It is one thing that I impose explicit structured data elements/attributes (i.e., [population:=3,993,933]). It is quite another when attempting to glean data from general text. Deciphering structured data is always a doable task.

Tim, I agree with your idea of the ultimate address book. The relationships are already there! The strength of those relationships may not be so apparent.

Walter   [03.12.07 11:24 AM]

Have a look at WikiLens: http://www.wikilens.org

As far as I know WikiLens uses SemanticWikiSyntax internally but maps it to a better interface.

peter   [03.12.07 03:09 PM]

I do not think that the Semantic MediaWiki approach has no immediate benefit to the user. I am sure, that if SMW would be integrated into WP, there would be enough (power) users who would gracefully use it for additional semantic annotations. All other users would just igrnore the semantic sugar.
Think of all the - now manually maintained - lists of movies from the 1970's from an Italian director. The quality and completeness of such content should really increase and nearly arbitrary queries would be possible. Think big..

Denny Vrandecic   [03.12.07 03:45 PM]

About Semantic MediaWiki, you ask, "why hasn't this approach taken off?" Well, because we're still hacking :) But besides that, there is a growing number of pages who actually use our beta software. Take a look at discourseDB for example. Great work there!

You give the following answer to your question: "Because there's no immediate benefit". Actually, there is benefit inside the wiki: you can ask for the knowledge that you have made explicit within the wiki. So the idea is that you can make automatic tables like this list of Kings of Judah from the Bible wiki, or this list of upcoming conferences, including a nice timeline visualization. This is immediate benefit for wiki editors: they don't have to make pages by hand repeat themselves over and over. Less work, higher quality -- that deems me to be beneficial. Having the data available in standard formats is just an extra: the actual benefit is inside the wiki itself.

Guy   [03.12.07 04:25 PM]

On second thought, maybe we just need to start wearing our metadata on the outside, like accessories instead of like underwear.

bdeseattle   [03.12.07 07:45 PM]

Tim - If what you say is true -- that we need to build smarter webapps that do a better job of implicitly collecting user-generated metadata -- without putting the burden on the end user to "figure it out" -- don't we need better tools and refined techniques (aka lessons learned from web2.0) to properly build out the next generation of [semantic]web3.0 apps?

Look at some of the more popular 2.0 sites that collect/leverage user-generated metadata -- del.icio.us, flickr, technorati, myweb2.0, bluedot, digg, etc. -- you know the sites that expose fancy keyword tagging interfaces and then harvest the metadata to drive the socially-driven phenom.

In my opinion, these sites are still too geeky, aren't normalized (see my previous rant on this topic), and are generally too hard for the average end user to really figure out. Do I use Blogger labels or Technorati tags? Are keywords separated with spaces or commas? If I'm asking these questions, then certainly the average joe is going to be lost as to the benefits (let alone the existence) of user-contriubted metadata.

My point is that today -- in web2.0 land -- there's really no consistency in the way users contribute metadata. If Yahoo can't rationalize their own tagging systems, then how can we expect the entire industry to build better [semantic]web3.0 apps that will reach the masses and bring the benefits and promise of semanticweb techniques to the average joe?

I think what we need are better tools, techniques, and perhaps standards (uM?) that will allow us geek types to build better webapps that take the guesswork out of user-contributed metadata, ultimately taking the burden off of the user to "figure it all out" in the browser.

With today's tools and techniques, it seems like it's still too hard to build a webapp for the masses that captures enough quality user-contributed metadata that will be required to properly feed the next generation of semanticweb apps. Don't get me wrong - we came a long way with web2.0, but I think we can do better. After all isn't most of web2.0 still in beta? I think it's time we got web2.0 out of beta, into production, and begin work on web3.0.

Shouldn't the next generation of [semantic]web3.0 apps start with some of the fundamental lessons learned from web2.0? Things like unobtrusively capturing end user behavior (while taking privacy concerns into account) and implicitly collecting only the set of metadata needed to feed the semanticweb engine.

Then its just a matter of us geek types kicking things up another notch and harness the firehose of metadata that will come spewing out of the semanticweb engine.

If web2.0 can give us sites like Original Signal and Digg Swarm, I can only imagine the doors that will open once we get access to richer user-contributed metadata on the web3.0 dev platform.

Thanks again for the great post Tim!

PS - Given the nature of my comment, I find it funny that I'm forced to type raw HTML in a textarea input to format my comment and include my reference links. More testament to the fact that we have a lot of work to do to bring the true benefits of semanticweb --and even web2.0 techniques to the masses. Should the average user have to know HTML just to respond to a blog post? Food for thought.

Boris Anthony   [03.12.07 10:05 PM]

"Web 3.0" is the love child that will arrive when the navel-gazing teenager that is Web 2.0 grows up a bit, becomes a bit more worldly and cooperative and meets a Semantic Web who has learnt to loosen up and slip into something more sexy.

They've already met and sparks were seen.

Kevin Marks   [03.12.07 10:48 PM]

The other approach you have overlooked here is the Microformats one, where shared semantic meaning is added to the HTML markup that everyone is already publishing. You use of the term "small s semantic web" made me smile, as this is what Tantek and I called the evening unconference presentation we made at eTech 3 years ago, when we introduced Microformats.

Danny   [03.13.07 12:04 PM]

[[
But the fact that I have someone in my address book already expresses a relation!
]]
Absolutely!

The problem with most current approaches to such data is that it's only useful in that one, domain-specific application. But it is perfectly good data that could be used (and hence reused) with Semantic Web technologies.

[[
All of these heuristics, properly collected, would provide a far more powerful "friend of a friend" network than anything built explicitly with FOAF.
]]
Those 'heuristics' will provide source data, but then what? You still need a way of representing the relationships in a useful, machine-processable fashion.

If you want this stuff to work on the web, then you might want to consider using the identifiers of the web, URIs, to identify the things you are talking about and the relationships between them. RDF is a framework for doing this.

FOAF is an RDF vocabulary that can be used to describe people in the common model. The source of the data is fairly irrelevant.

Handy that Kevin should mention microformats - this is a very good example. Information put online using the XFN microformat can be transparently read as RDF (using GRDDL). In that form it can be freely integrated with data from other sources, queried, filtered, republished, whatever.

Dominic Widdows   [03.13.07 12:31 PM]

Tim's main point is well-made - the funny thing is that, from any traditional information design point of view, it's obvious. If you want someone from any remotely normal sector of the population to give you structured data, you don't try and teach them the syntax of some structured language, you design a form for them to fill in. Shut your eyes for a minute and try to imagine applying for a drivers license, filing taxes, applying for a passport ... with the presumption that your user base is going to learn some kind of XML or Wiki syntax well enough to send you data that your system can correctly parse and enter into its database. Our national bureaucracies would crumble within days.

The question to ask isn't "could Web 2.0 / small s semantic web / do a better job of gathering structured data if they give people a form to fill in?". Of course they could. The interesting question to ask is"why on earth did anyone imagine that people were going to learn some structured language to do it in free text?". That's an interesting story, and it goes back to Tim's (absolutely correct) point that people contribute useful data when they see the immediate benefit. Enter HTML. HTML gained its popularity not as a semantic structuring language, but as a visual formatting language. Suddenly a lot of people could learn to type "<table>" and "<b>" and "<a>" and stuff like that - not because the W3C said it was a virtuous thing to do, but because they could see the results before their eyes in a web browser, they knew their friends would be able to see it, they knew their friends would think it was cool, and it took off. A technology had found its niche.

Should anyone have ever gone on to assume that the same solution could be applied to semantic structuring? Of course not. One technology is aimed at getting a human to appreciate visual layout, which humans are good at doing: the other is aimed at getting a computer to appreciate semantic intent, which computers are (so far at least) lousy at doing. Of course, Semantic Web and Web x.0 designers should look to traditional methods for soliciting semantically structured data. If you doubt this, try using emacs to type out your tax return in XML - but please don't try the experiment for more than a few minutes, it's a really really bad idea!

jeremy liew   [03.13.07 12:37 PM]

Tim,

Extracting metadata from the implicitcation of other actions is a wonderful low-touch approach to gethering metadata for the semantic web. LinkedIn's toolbar does a nice job for instance by prompting the addition of people you frequently send and receive email from to your address book. But this requires some set of business rules that probably work better in closed environments than open ones, which might not be compatible with the principles of the semantic web. Making the metadata that is collected in this way publicly available would work, but the specific implementations of the business rules likely need to be controlled in some way.

Anonymous   [03.13.07 03:47 PM]

"By contrast, in freebase, an entry about Germany would show an explicit form intended to capture critical statistics about a location."

I hope it's a bit more refined than just 'population'. Because by itself the term 'population' is only semi-useful. For a start, population when? Usually resident population, or headcount? Estimated or actual? Etc etc etc.

Part of the problem with user supplied metadata is that the metadata is usually not well enough defined.

Evorgleb   [03.13.07 10:08 PM]

Geni.com is such a great idea! I just did a blog post about Geni.com over on Highbrid Nation if you want to check it out. I look forward to building my tree and maybe meeting some new family members along the way. I think the site has a great future.

Gregory Kohs   [03.20.07 11:24 AM]

Tim, this article is an amazing work. Thank you for writing it. We are building a Semantic Mediawiki directory for persons and businesses, which people can explore at Centiare.com.

To add to Denny Vrandecic's comments about why Semantic Mediawiki is actually still pretty cool -- Centiare allows our users to serve up not only Google AdSense ads, but YouTube videos, Amazon Associates storefronts (all organized with RDF), and even document and spreadsheet storage. Not only that, but we're finding that Google is going nuts over ranking Centiare Directory listings in search results.

For example, try searching ((Littleton graphics stationery)) in Google. "Lion Graphics" comes up #1 out of over 35,000 results. Try searching ((stone crab scientific name)). A rinky-dink page on Centiare comes up higher than Wikipedia's own page about the stone crab.

I don't think Geni (as cool as it looks) is ever going to get this kind of immediate search engine benefit.

Again, thank you for such an informative, insightful article!

Fred Howell   [03.24.07 03:45 AM]

Another approach to providing immediate benefit to users for adding semantic info to web pages is being developed by Textensor - it starts with a sticky notes application in the web browser and allows extra links, relations and tags to be added to highlighted phrases in text.

There's a quick screenshot tour or a detailed white paper comparing the approach to semantic wikis (and to Vannevar Bush's ideas on semantic annotation from 1945...)

PTC   [05.07.07 09:29 PM]

Well said!
Semantic Web is defintaly coming but it is still vague when its coming. Considering the aspect of immediate benefits to users (as you have mentioned), there are other contenders for next generation web like pervasive web or second life. Semantic web is certainly going to take a long time before realization.

A comprehensive take on Semantic Web at http://techbiz.blog.com/1730241/

Dan   [06.24.07 06:42 PM]

Walter,

Thanks for mentioning WikiLens.

I do not know what SemanticWikiSyntax is. We do use a form of wiki syntax to specify fields. See for example http://wikilens.org/wiki.php/WikiLens/StructuredDataTutorial.

I am uncertain of the possible success of the semantic web as a whole. I think its proponents should get behind some very specific applications to make it popular.

Dan

Gombos Atila   [07.31.07 05:20 PM]

Semantic Web is a nice concept but is hard to became a reality.

Tim Barners-Lee, said: "I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web - the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize"

So in the future, everything is possible.

ibolya   [08.01.07 02:57 AM]

Thank you for this interesting post, Tim! I also think that our move to the Semantic Web is going progressively and that this evolution relies on developers' adoption of bridging technologies like microformats or plugins for popular blog/cms platforms.

William   [08.01.07 11:21 AM]

Semantic web applications, web 2.0 and the web 3.0 some are trying to create are all great but is it no one but me that feels that these tools are being over used?

Semantic web applications and web 2.0, can and are used on good ways which Geni is a perfect example of (thanks for the link btw) but simplicity is king and web 2.0 features are often used where they are not needed.

I would like to urge everyone to remember that these are tools and you'll get the best result if you only use these tools when needed.

asin   [08.01.07 11:32 AM]

sir,This kept me thinking for a day . thats why i am commenting here . got much informations about "SEMANTIC" web . thanks a lot . Read all your articles at oreilly. Thanks alot

Tupac   [08.01.07 02:47 PM]

"Web 3.0" is the love child that will arrive when the navel-gazing teenager that is Web 2.0 grows up a bit, becomes a bit more worldly and cooperative and meets a Semantic Web who has learnt to loosen up and slip into something more sexy.

They've already met and sparks were seen.


Great reply :)

Andrei D.   [08.02.07 06:04 PM]

In my opinion, to achieve Web 3.0 we also need a cohesive identity management system that’s user centric rather than vendor-centric. Right now, user identity is scattered across all service providers and web 2.0 microcap niche services - I have my ID at Amazon, at PayPal, at Yahoo, at LinkedIn, etc... I have to go to these sites to get services. It’s a mirror image of the traditional brick-n-mortar where I have to go there to get service. I hope in Web 3.0, we turn it upside down and they come to me. And an identity system is crucial to make that happen.

netfreez   [08.08.07 02:31 PM]

Great article, although my thought is also that we are still several steps away from web the 3.0 era. With the development of devices such as the iPhone, the general idea of what to expect with web 3.0 can already be seen in some of today's applications and technological implementations within websites.

Also, good comment by "Guy", twisted but nice approach to explain your point. I believe that in the SEO world, you would see developers creating search-friendly websites by packing a lot of the information within external files or as you like to say, accessories.

skierpage   [08.12.07 06:47 PM]

Note that a lot of MediaWiki info is in MediaWiki templates; in fact they're a main source of info for DBpedia. Someone would modify the template once to include the Semantic MediaWiki syntax, and it's no extra effort for users to enter semantic information. There's even a nifty DiscourseDB extension to the extension that turns completing a MediaWiki template into a form.



Semantic MediaWiki goes beyond template hacking to let users express semantic information anywhere in wiki text, so it can capture more and better information than DBpedia's excellent "template scraping".



Templates in MediaWiki are powerful but cumbersome, it's impressive that Wikipedians are able to do so much with them. Freebase (didn't that burn a naked Richard Pryor ?) types sounds better and easier, but the underlying metaweb.com tech is NOT open source (yet?).

Luke   [08.21.07 08:27 AM]

LOL@ On second thought, maybe we just need to start wearing our metadata on the outside, like accessories instead of like underwear.

That really is a good one :D

Chris Wong   [08.22.07 09:01 AM]

This is a very good point and I am guessing that we are going to be seeing a lot more of this soon. In fact, I would go even further personally to make the definition slightly more strict than just a requirement for use of a technology in the SW cake like RDF. Perhaps I would venture to say that there must also be an intention for improved integration and better machine interoperability? Eh... that's debatable. I see this a lot with Service Oriented Architecture (SOA). Any time a new technology term creates a wave in the zeitguiest, every marketeer in the industry tries to surf it before they really even understanding it. There will be a lot of people and organizations who want to capitalize on the buzz around Semantic Web (if there is one). But I think we also have to be careful to have consideration for people who may simply not understand, but are trying hard to. I know from my own experience that there are many concepts within the Semantic Web and not a one of them has a simple name or is that easy to understand. Many people will, I predict, use the terminology incorrectly by mistake at first, but not for the sake of gaming.

Daniel   [10.05.07 03:50 PM]

One more thought: The Semantic Web is a framework that rigidly defines a means for creating statements of the form “Subject, Predicate, Object” or “triples,” in a machine-readable format, where each of Subject, Predicate, Object is a URI.

J.O. From Urban MVP   [10.09.07 02:25 PM]

Real interesting read.

I think what is missing in the article though is the network effects and benefits that web 2.0 design has on websites. Although Self-Interest is the driving force in economics it doesn't have to be for a web 2.0 project to be effective. volunteerism can and does work with web 2.0 projects due to the network effect the one receives. I guess you can contribute this back to self-interest but what action can you not if you follow that logic?

A great example and probably an abused example is Wikipedia which was born out of volunteerism. What made this project successful is purely the network effect ie, the more people used it the more useful the website became and therefore the more dependent the users became. That is the essence of Web 2.0 success. So at the end of the day it really doesn't matter what key factor is developed to be the logical growth of a project whether it be for self-interest or volunteerism.

Tim O'Reilly   [10.09.07 09:37 PM]

J.O. -

I didn't say that volunteerism wasn't effective, just that architecting a system so that people don't even know they are contributing is even more effective. You don't think you're "contributing" to the web when you make a link to another site, but you are.

Amie Stilo   [10.10.07 03:33 AM]

Not really up on the terminology "Semantic" but it was an interesting read anyway. Sorry for me this layman understanding but from what I can see you are making a cross reference between an "open" Internet as opposted to a closed one? In which case I would agree with you that as Bill gates proved with his ibm relationship the open format or open networking will always prevail over closed proprietary thinking. Again sorry for the mind hack.

Listener   [10.26.07 12:38 PM]

Semantic in web is very dufficult problem. Google specialists working on this problem for many years, but you really good wrote about it. Thank you.

Jason   [12.29.07 05:20 AM]

I think that 2008 in many ways will be the year hyper-local targeting becomes feasible. What this opens up also is the ability to dynamically change content based on both behavioral and geo segmentation.


Post A Comment:

 (please be patient, comments may take awhile to post)




Remember Me?


Subscribe to this Site

Radar RSS feed

BUSINESS INTELLIGENCE

CURRENT CONFERENCES