Big data and the semantic web

At war, indifferent, or intimately connected?

On Quora, Gerald McCollum asked if big data and the semantic web were indifferent to each other, as there was little discussion of the semantic web topic at Strata this February.

My answer in brief is: big data’s going to give the semantic web the massive amounts of metadata it needs to really get traction.

As the chair of the Strata conference, I see a vital link between big data and semantic web, and have my own roots in the semantic web world. Earlier this year however, the interaction was not yet of sufficient utility to make a strong connection in the conference agenda.

Google and the semantic web

A good example of the development of the relationship between big data and the semantic web is Google. Early on, Google search eschewed explicit use of semantics, preferring to infer a variety of signals in order to generate results. They used big data to create signals such as PageRank.

Now, as the search algorithms mature, Google’s mission is to make their results ever more useful to users. To achieve this, their software must start to understand more about the actual world. Who’s an author? What’s a recipe? What do my friends find useful? So the connections between entities become more important. To achieve this Google is using data from initiatives such as schema.org, RDFa and microformats.

Google do not use these semantic web techniques to replace their search, but rather to augment it and make it more useful. To get all fancypants about it: Google are starting to promote the information they gather toward being knowledge. They even renamed their search group as “Knowledge”.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Metadata is hard: big data can help

Conventionally, semantic web systems generate metadata and identified entities explicitly, ie. by hand or as the output of database values. But as anybody who’s tried to get users to do it will tell you, generating metadata is hard. This is part of why the full semantic web dream isn’t yet realized. Analytical approaches take a different approach: surfacing and classifying the metadata from analysis of the actual content and data itself. (Freely exposing metadata is also controversial and risky, as open data advocates will attest.)

Once big data techniques have been successfully applied, you have identified entities and the connections between them. If you want to join that information up to the rest of the web, or to concepts outside of your system, you need a language in which to do that. You need to organize, exchange and reason about those entities. It’s this framework that has been steadily built up over the last 15 years with the semantic web project.

To give an already widespread example: many data scientists use Wikipedia to help with entity resolution and disambiguation, using Wikipedia URLs to identify entities. This is a classic use of the most fundamental of semantic web technologies: the URI.

For Strata, as our New York series of conferences approaches, we will be starting to include a little more semantic web, but with a strict emphasis on utility.

Strata itself is not as much beholden to big data, as about being data-driven, and the ongoing consequences that has for technology, business and society.

Related:

tags: , , ,

Get the O’Reilly Data Newsletter

Get weekly insight from industry insiders—plus exclusive content, offers, and more on the topic of data.