Big data and the semantic web

At war, indifferent, or intimately connected?

On Quora, Gerald McCollum asked if big data and the semantic web were indifferent to each other, as there was little discussion of the semantic web topic at Strata this February.

My answer in brief is: big data’s going to give the semantic web the massive amounts of metadata it needs to really get traction.

As the chair of the Strata conference, I see a vital link between big data and semantic web, and have my own roots in the semantic web world. Earlier this year however, the interaction was not yet of sufficient utility to make a strong connection in the conference agenda.

Google and the semantic web

A good example of the development of the relationship between big data and the semantic web is Google. Early on, Google search eschewed explicit use of semantics, preferring to infer a variety of signals in order to generate results. They used big data to create signals such as PageRank.

Now, as the search algorithms mature, Google’s mission is to make their results ever more useful to users. To achieve this, their software must start to understand more about the actual world. Who’s an author? What’s a recipe? What do my friends find useful? So the connections between entities become more important. To achieve this Google is using data from initiatives such as schema.org, RDFa and microformats.

Google do not use these semantic web techniques to replace their search, but rather to augment it and make it more useful. To get all fancypants about it: Google are starting to promote the information they gather toward being knowledge. They even renamed their search group as “Knowledge”.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Metadata is hard: big data can help

Conventionally, semantic web systems generate metadata and identified entities explicitly, ie. by hand or as the output of database values. But as anybody who’s tried to get users to do it will tell you, generating metadata is hard. This is part of why the full semantic web dream isn’t yet realized. Analytical approaches take a different approach: surfacing and classifying the metadata from analysis of the actual content and data itself. (Freely exposing metadata is also controversial and risky, as open data advocates will attest.)

Once big data techniques have been successfully applied, you have identified entities and the connections between them. If you want to join that information up to the rest of the web, or to concepts outside of your system, you need a language in which to do that. You need to organize, exchange and reason about those entities. It’s this framework that has been steadily built up over the last 15 years with the semantic web project.

To give an already widespread example: many data scientists use Wikipedia to help with entity resolution and disambiguation, using Wikipedia URLs to identify entities. This is a classic use of the most fundamental of semantic web technologies: the URI.

For Strata, as our New York series of conferences approaches, we will be starting to include a little more semantic web, but with a strict emphasis on utility.

Strata itself is not as much beholden to big data, as about being data-driven, and the ongoing consequences that has for technology, business and society.

Related:

tags: , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • http://metacert.com Paul Walsh

    We will announce on June 28th, a massive implementation of metadata that will make a huge difference to protecting children from adult content.

  • http://radar.oreilly.com/edd Edd Dumbill

    Scott,

    You miss my main point: that the semantic web technologies are needed to describe and exchange the results of analyzing big data.

    That said, there will doubtless be a need for collecting and processing big semantic web data as well, though the problems of distributed graph analysis are harder there and may be a while before solved.

  • http://textiplication.com Skott Klebe

    What about tools? The last time I looked at RDF triplestores, the largest in the world ranged into the billions of triples. A single document could easily run to dozens of triples, which would mean that today’s modest hundred-million row entity would fill the largest triple store in the world.
    If the semantic web needs big data to get traction, where’s the big data going to be stored?

  • http://dannyayers.com Danny

    Edd, great stuff. I agree entirely about semweb tech offering a means of working with what big data has to offer. And Google’s shift to “Knowledge” is notable.

    re. tools – even without big true triplestores, it’s still possible to express data from other kinds of stores as RDF. An extreme example is Linked Data on the Web. But ok, for processing purposes such material does need to be combined. But unlike a lot of big data (e.g. text corpuses for search), semantic web data is naturally split up (grr, there’s a word I’ve forgotten) along useful axes. So given a particular task you can query/filter, merge and re-query as often as you like in a relatively focused fashion. Big storage should rarely be necessary.

  • http://dannyayers.com Danny

    Edd, great stuff. I agree entirely about semweb tech offering a means of working with what big data has to offer. And Google’s shift to “Knowledge” is notable.

    re. tools – even without big true triplestores, it’s still possible to express data from other kinds of stores as RDF. An extreme example is Linked Data on the Web. But ok, for processing purposes such material does need to be combined. But unlike a lot of big data (e.g. text corpuses for search), semantic web data is naturally split up (grr, there’s a word I’ve forgotten) along useful axes. So given a particular task you can query/filter, merge and re-query as often as you like in a relatively focused fashion. Big storage should rarely be necessary.

  • http://dannyayers.com Danny

    Edd, great stuff. I agree entirely about semweb tech offering a means of working with what big data has to offer. And Google’s shift to “Knowledge” is notable.

    re. tools – even without big true triplestores, it’s still possible to express data from other kinds of stores as RDF. An extreme example is Linked Data on the Web. But ok, for processing purposes such material does need to be combined. But unlike a lot of big data (e.g. text corpuses for search), semantic web data is naturally split up (grr, there’s a word I’ve forgotten) along useful axes. So given a particular task you can query/filter, merge and re-query as often as you like in a relatively focused fashion. Big storage should rarely be necessary.

  • http://dannyayers.com Danny

    aaaagh! It kept giving me 500s, so I left it for a bit and tried again…

  • http://uoccou.wordpress.com uoccou

    Aren’t you just talking about Linked Data ?

  • http://mashraqi.com Frank Mashraqi

    “My answer in brief is: big data’s going to give the semantic web the massive amounts of metadata it needs to really get traction.”

    Really???

    No comment on your following statements:

    “Once big data techniques have been successfully applied”

    May I ask what big data technologies are you referring to?

    Sorry, but I expect a lot more from the chair of Strata…

    Frank

  • http://on-meaning.blogspot.com Yuriy Guskov

    Stop bothering users with metadata. I’m serious. An ordinary user is not obliged to know about metadata, taxonomies, or ontologies. The problem with generating metadata is rather not in generating, but in the fact, Semantic Web does not allow human-friendly and iterative approach. Web 1.0 did, therefore, gradually, it was filled with information. And who will use Semantic Web metadata? Applications? Then what’s reason for users to generate them? They want generate information, which they themselves to use. Semantic Web, its formats, and tools are something fictitious for users. Moreover, did you hear complaints from developers (not users) that Semantic Web is too complex and cumbersome for them? And you want to make users to use it? Which way?

    I see the only solution: Semantic Web should be human-friendly. Please, keep it simple. At the same time, if you really want to integrate semantics, you should think broader. It should concern not only Semantic Web itself, but even ways we interact with computers, user interface, and file system. I’m talking about a sort of semantic ecosystem. My proposal on that topic can be found at:

    http://on-meaning.blogspot.com/2011/06/great-blunders-of-modern-it-and-their.html

    I’m not saying it is the only possible solution, but, at least, it can hint what may be changed around semantics.

    PS: Btw, identification with Wikipedia, which described in this article is not appropriate, in fact, because URIs, by itself, are not appropriate. Starting from the fact, that URI is used for information resources, whereas real objects are not information resources. Moreover, ideally, such URIs should cover any identifier. For example, you need to identify a hotel, and room in it. So, each hotel should have specific URI or ask some provider to provide such URI? I think that identification should be more flexible. First, it should be not URI-based, which allows to have cloud-based identification, that is, different providers can maintain the same identifier. Second, identification should be decentralized, because people and organizations would want own identification systems (calculate: billions of people and organizations, each may have thousands identifiers, which specific only for them). Third, identification routing is needed (because the same identifier may have different subjective meaning).

  • http://www.tipradar.com Mike

    What about wolfram alpha? i didn’t hear about it so much