Where the semantic web stumbled, linked data will succeed

Linked data allows for deep and serendipitous consumer experiences.

In the same way that the Holy Roman Empire was neither holy nor Roman, Facebook’s OpenGraph Protocol is neither open nor a protocol. It is, however, an extremely straightforward and applicable standard for document metadata. From a strictly semantic viewpoint, OpenGraph is considered hardly worthy of comment: it is a frankenstandard, a mishmash of microformats and loosely-typed entities, lobbed casually into the semantic web world with hardly a backward glance.

But this is not important. While OpenGraph avoids, or outright ignores, many of the problematic issues surrounding semantic annotation (see Alex Iskold’s excellent commentary on OpenGraph here on Radar), criticism focusing only on its technical purity is missing half of the equation. Facebook gets it right where other initiatives have failed. While OpenGraph is incomplete and imperfect, it is immediately usable and sympathetic with extant approaches. Most importantly, OpenGraph is one component in a wider ecosystem. Its deployment benefits are apparent to the consumer and the developer: add the metatags, get the “likes,” know your customers.

Such consumer causality is critical to the adoption of any semantic mark-up. We’ve seen it before with microformats, whose eventual popularity was driven by their ability to improve how a page is represented in search engine listings, and not by an abstract desire to structure the unstructured. Successful adoption will often entail sacrificing standardization and semantic purity for pragmatic ease-of-use; this is where the semantic web appears to have stumbled, and where linked data will most likely succeed.

Linked data intends to make the Web more interconnected and data-oriented. Beyond this outcome, the term is less rigidly defined. I would argue that linked data is more of an ethos than a standard, focused on providing context, assisting in disambiguation, and increasing serendipity within the user experience. This idea of linked data can be delivered by a number of existing components that work together on the data, platform, and application levels:

  • Entity provision: Defining the who, what, where and when of the Internet, entities encapsulate meaning and provide context by type. In its most basic sense, an entity is one row in a list of things organized by type — such as people, places, or products — each with a unique identifier. Organizations that realize the benefits of linked data are releasing entities like never before, including the publication of 10,000 subject headings by the New York Times, admin regions and postcodes from the UK’s Ordnance Survey, placenames from Yahoo GeoPlanet, and the data infrastructures being created by Factual [disclosure: I’ve just signed on with Factual].
  • Entity annotation: There are numerous formats for annotating entities when they exist in unstructured content, such as a web page or blog post. Facebook’s OpenGraph is a form of entity annotation, as are HTML5 microdata, RDFa, and microformats such as hcard. Microdata is the shiny, new player in the game, but see Evan Prodromou’s great post on RDFa v. microformats for a breakdown of these two more established approaches.
  • Endpoints and Introspection: Entities contribute best to a linked data ecosystem when each is associated with a Uniform Resource Identifier (URI), an Internet-accessible, machine readable endpoint. These endpoints should provide introspection, the means to obtain the properties of that entity, including its relationship to others. For example, the Ordnance Survey URI for the “City of Southampton” is http://data.ordnancesurvey.co.uk/id/7000000000037256. Its properties can be retrieved in machine-readable format (RDF/XML,Turtle and JSON) by appending an “rdf,” “ttl,” or “json” extension to the above. To be properly open, URIs must be accessible outside a formal API and authentication mechanism, exposed to semantically-aware web crawlers and search tools such as Yahoo BOSS. Under this definition, local business URLs, for example, can serve in-part as URIs — ‘view source’ to see the semi-structured data in these listings from Yelp (using hcard and OpenGraph), and Foursquare (using microdata and OpenGraph).
  • Entity extraction: Some linked data enthusiasts long for the day when all content is annotated so that it can be understood equally well by machines and humans. Until we get to that happy place, we will continue to rely on entity extraction technologies that parse unstructured content for recognizable entities, and make contextually intelligent identifications of their type and identifier. Named entity recognition (NER) is one approach that employs the above entity lists, which may also be combined with heuristic approaches designed to recognize entities that lie outside of a known entity list. Yahoo, Google and Microsoft are all hugely interested in this area, and we’ll see an increasing number of startups like Semantinet emerge with ever-improving precision and recall. If you want to see how entity extraction works first-hand, check out Reuters-owned Open Calais and experiment with their form-based tool.
  • Entity concordance and crosswalking: The multitude of place namespaces illustrates how a single entity, such as a local business, will reside in multiple lists. Because the “unique” (U) in a URI is unique only to a given namespace, a world driven by linked data requires systems that explicitly match a single entity across namespaces. Examples of crosswalking services include: Placecast’s Match API, which returns the Placecast IDs of any place when supplied with an hcard equivalent; Yahoo’s Concordance, which returns the Where on Earth Identifier (WOEID) of a place using as input the place ID of one of fourteen external resources, including OpenStreetMap and Geonames; and the Guardian Content API, which allows users to search Guardian content using non-Guardian identifiers. These systems are the unsung heroes of the linked data world, facilitating interoperability by establishing links between identical entities across namespaces. Huge, unrealized value exists within these applications, and we need more of them.
  • Relationships: Entities are only part of the story. The real power of the semantic web is realized in knowing how entities of different types relate to each other: actors to movies, employees to companies, politicians to donors, restaurants to neighborhoods, or brands to stores. The power of all graphs — these networks of entities — is not in the entities themselves (the nodes), but how they relate together (the edges). However, I may be alone in believing that we need to nail the problem of multiple instances of the same entity, via concordance and crosswalking, before we can tap properly into the rich vein that entity relationships offer.

The approaches outlined above combine to help publishers and application developers provide intelligent, deep and serendipitous consumer experiences. Examples include the semantic handset from Aro Mobile, the BBC’s World Cup experience, and aggregating references on your Facebook news feed.

Linked data will triumph in this space because efforts to date focus less on the how and more on the why. RDF, SPARQL, OWL, and triple stores are onerous. URIs, micro-formats, RDFa, and JSON, less so. Why invest in difficult technologies if consumer outcomes can be realized with extant tools and knowledge? We have the means to realize linked data now — the pieces of the puzzle are there and we (just) need to put them together.

Linked data is, at last, bringing the discussion around to the user. The consumer “end” trumps the semantic “means.”


tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Tyler, nice job synthesizing some of the key consideration and possibilities of linked data.

    However, the benefits of linked data extend FAR beyond consumer experiences. In fact, I would go so far to say that the value in the application of linked data in non-consumer applications could well be many multiples of that in the consumer space. I take a broader view that it isn’t just about data, either. The value is amplified when you capture linked data, services, events – all of the elements of an activity, person, thing, machine, transaction, etc.

    We’ve implemented these concept in our ThingWorx platform. Linked data models need to be much more elastic and adaptive that traditional data models in order to allow you to solve problems/ask questions you didn’t even know you had. This is at the core of what we’ve done with our model-based application development and query/analysis engine and represents a re-thinking of how technology can be applied to solve a new class of problems. Incidentally, it’s built upon a graph model database.

    I think this would be a great topic for the Strata conference next February, and we’d be happy to share our real-world experiences in applying these concept if you’re interested.

    The other comment I want to make is that all of the acronym soup in the area of semantic data and applications is a big distraction (RDF, OWL, etc.). These are implementation details (maybe at some future point, a consideration for interoperability, though I’m skeptical) that have little to do with delivering the end results of linked data. We’ve chosen to focus on the functionality at this stage(and coincidentally, addressed quite a few if not all of the bullet points you pointed out).

    Looking forward to seeing more about the topic in the months and years ahead!

  • You raise a number of insightful points with regards to pragmatic aspects of the burgeoning Web of Linked Data, for sure, but I think it’s important to clarify a few things:

    1. Linked Data is neither ethos nor “best practice”. its about hypermedia based structured data. Just as HTML is about hypermedia based structured documents.

    2. TimBL’s famous and popular meme about how to publish (actually inject) Linked Data into the World Wide Web could be perceived as ethos or “best practice” since it does prescribe use of W3C standards such as RDF formats and the SPARQL query language.

    OpenGraph, OData, GData, Microformats, MicroData, RDFa, RDF formats (RDF/XML, N3, Turtle, TriX, etc..). All provide different mechanisms for hypermedia based structured data, but with varying level of semantic granularity.

    Most important (IMHO) is that we are now entering the Structured Data era (Web 3.0) re. general Web evolution. The intrinsic value of Linked Data (hypermedia based structured) data is becoming clearer by the second.


  • paul morgan

    Linked Data is most certainly an ethos – maybe not an academic ethos, but most certainly a web developer one. I think your points relate more about the implementation of that ethos and that an ethos cannot be so if it doesn’t follow a Spec.

    There does appear to be a divide presently between the academic world and the pragmatic web developer world. The former like the clear structure and process of RDFa et al, whereas the latter go for the microformat et al implementation because of its simplicity and speed to deploy.

    If you want greater traction, you need to make it easier, simpler, faster for the average web dev to grok and deploy. Simples!

    A timely and thought provoking post. :)

  • Very good overview of semantic web and linked data.

    I think we are just now starting to move from the theory of the semantic web to the application and development of the semantic web, which is due primarily to linked data. Linked data seems to have been the missing the piece in the Semantic Web (as well as tools to create markup and RDFs, so that it is easier to create and share data — WYSWYG editors/creators…)

    “However, I may be alone in believing that we need to nail the problem of multiple instances of the same entity, via concordance and crosswalking, before we can tap properly into the rich vein that entity relationships offer.”

    I definitely agree this is going to be a huge issue as we continue to link more data together. I currently work with a database which uses student input metadata; for some professors we already have 25-30 variations of their names. In some cases, those variations are so different, it would be almost impossible to find them. Perhaps, once we get to AI, these sort of issues will cease to exist.

    Libraries have been dealing with this issue for years, in that they try to determine who a particular author is, so that they can provide a single access/search point to all of an author’s works regardless of format of work or variation of the author’s name (among other access/search points, of course…)

    It is extremely hard to do. Not only do people (and other entities) change their names (informally and formally), there are misspellings, pseudonyms, variations, as well as multiple instances of the same name, in some cases with the same information.

    Libraries have frequently used dates (birth, death, and even estimated dates of life time) to develop a master record (authority record) to sort out who is really who.

    The ViAF (Virtual Authority File) is a global project to come up with such records for use on the web.

    Of course, then the question is: does anyone care? Do we care if we don’t get everything that we are seeking because the data can’t be linked together? Do we care if we get erroneous hits because the data was linked incorrectly (false relationship?)

    I also do not see this as a divide between the academic world and practitioner (@paul morgan) ; it is just a natural evolution between theory and practice. There are very practical web developers and designers at any university; I can assure you. ;-)

    Even in academic discussions about RDA (the new library cataloging rules) which is readying library data for the semantic web, linked data is being discussed front and center in terms of how it fits with structured data and what it can do with structured data.

    Linked data will enable libraries legacy (existing and inherited structured data) to be crosslinked, linked, broken into data elements and repackaged; opening up data to be used in new ways.

    Just my two centers.
    robin @georgiawebgurl

  • This is a wonderful article which outlines important aspects of the linked data charisma and related tasks and issues. I agree with most of the comments also, particularly with the importance of the identity resolution problem pointed out by Robin.

    I have a critical comment however – I find the statement “RDF, SPARQL, OWL, and triple stores are onerous” a bit too farfetched. Please consider that the BBC’s Worldcup Web site, that you refer to as positive example, is based on a semantic repository, all the metadata is represented in RDF with respect to a comprehensive OWL ontology and SPARQL queries are used to generate the so called index pages on the site. In “BBC”>http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html”>“BBC World Cup 2010 dynamic semantic publishing” Jem Rayfield says “A RDF triplestore and SPARQL approach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model”

    I can easily imagine a good amount of rough experiences with early versions of some of these engines which support your statement that triple stores are onerous. Still this is the case with any new technology – I remember lots of people having rough experiences with HTTP servers in the late 90s. The fact that linked data serves as a nice and very important starting point does not mean that everything else in the Semantic Web should be neglected.

    Finally, deep serendipitous linked data experiences often require combining data from multiple datasets/source and a little bit of interpretation. In theory this can be done in many different ways using various sorts of technology. I claim that in practice the only efficient way to do it is to load these datasets in a semantic repository – a triple store capable to do inference, i.e. to interpret the semantics of the ontologies, schemata and the data. I will provide a concrete example – our system FactForge (previously known as LDSR), which provides a “reason-able view” to 8 of the central LOD datasets: DBPedia, Geonames, Freebase, Wordnet, etc. This was the only system which provided a solution to The”>http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php”>The Modigliani Test for Linked Data defined by Richard MacManus from ReadWriteWeb. The query required joining data from DBPedia, FactForge, UMBEL and OpenCyc and some inference. I imagine that others would be willing to do such queries also. And I have a hypothesis why they do not: without an efficient semantic repository this is impractical.

    Atanas Kiryakov, Ontotext

    p.s.: really last word – I think we have a good example of serendipitous experience. Can you guess who is the most popular Germany-born entertainer? If anyone is curious on how such query can be answered based on linked data and what is the answer, check slide #57 here”>http://www.slideshare.net/ontotext/bringing-the-semantic-web-closer-to-its-tipping-point”>here.

  • Great article, Tyler, and I agree with your analyses in general, but have to raise an objection regarding the last point on Relationships. For a more expressive implementations of semantics on the web, we need a language like OWL, and developing entity definitions in isolation from the relationships would cause more problems than it sets out to solve. Note that I said “semantics on the web” and not Semantic Web. In my opinion the “Semantic Web” has not yet been clearly defined, which is why we see many technologies, philosophies and implementations being placed into this bucket simply because there is no other place. You list the most obvious (RDF*, SPARQL, OWL) and I think all of these play a crucial part in allowing us to express the semantics of data, and Linked Data is a key element. Because of this mesh up the Semantic Web gets an underserved bum rap.

    I completely agree that we must get the fundamentals correct first, and a big part of that are Entities. Linked Data can help with identifying individual elements, concepts, etc, BUT defining relationships and assigning semantics to those relationships needs a much more expressive representation than Linked Data or even RDF provides. Linked Data and the associated URI’s only create points of reference. It allows you to see what a “Person” is, for example, with some related meta-data. But to truly represent semantics on the web we need to define how these references are related to each other directly or as parts of a greater structure. Introspection may fall short in this regard and may not be expressive enough to cover such global, and more complex relationships. Here’s a great set of slides (http://bit.ly/7MeGbc) on how Linked Data and “linked” ontologies can work together to make reasoning on the web that much more a reality. Taxonomies can convey hierarchical (classification) relations, and more complex ontologies can capture complex relationships (classification, part-hood, order, what role something plays in a particular context, etc). The internet is an open world, and the idea of using closed world principles, such as singular meanings (read defaults) defined by a URI is may be problematic.

    Linked Data is a great way to express what element you are talking about, but lack the expressiveness to define real semantics of relationships.

  • Great piece. Wholeheartedly agree. Not only is graph model for information practical today, it has many advantages over traditional relational data models – in a phrase – content negotiation flexibility.

    We’ve made a unified graph information model the foundation of our information system. We manage all information (data, program code, rules, legacy systems, etc.) as a web of tagged resources. It creates an environment of transformable assets that is perfect for our canonical method, which involves a form of mash-up pattern.

    The result is Context-Aware Enterprise Content Management.

    Semantics are the search for meaning. Linked data is the exploitation of meaning.

  • Great article. You’re obviously not “alone in believing that we need to nail the problem of multiple instances of the same entity”.

    Duplicate data is already a nearly unbearable problem (web spam). When the world-at-large jumps on the linked data bandwagon, there’s going to be an explosion of people trying to attract attention to their data sources by producing seemingly unique data that is only slightly modified from some other source.

    This problem is one of the motivating factors behind the creation Delineal. One of our goals is to fingerprint websites so that we can reasonably conclude whether the site is unique or not. This kind of information can contribute to crosswalking and concordance efforts.