Sat

Apr 14
2007

Tim O'Reilly

Tim O'Reilly

The Spock Entity Resolution Challenge

I wrote in my previous entry about spock that entity resolution is one of the key aspects of people search. Spock already does a pretty good job of this, but they want to get even better. As a result, they've offered the $50,000 Spock Entity Resolution Challenge:

From the challenge site:

To improve our technology and to create a better user experience, we decided to share the fun! We have selected one of our most interesting problems, namely Entity Resolution, to share with the community, allowing other leading computer scientists and engineers to compete in an open contest. The winners of this global competition will reap a handsome reward, and perhaps even employment at Spock.

You can work individually and in teams. The competition will last 4 months and the winning team will win a Grand Prize of $50,000! Most importantly you’ll be working on a very important and widely applicable problem. We will also be issuing prizes for 2nd and 3rd place.

A common problem that we face is that there are many people with the same name. Given that, how do we distinguish a document about Michael Jackson the singer from Michael Jackson the football player?

With billions of documents and people on the web, we need to identify and cluster web documents accurately to the people they are related to. Mapping these named entities from documents to the correct person is the essence of the Spock Challenge.

In order to constrain the problem so that it can be successfully solved by an individual or a small team, we provide you with real world data with ground truth. This data contains 100,000 documents about people, and the challenge is to determine all the distinct people described in the data set. This data can be your training set. Once you’ve got your basic algorithm working against the training set, we let you further tune your code by running it against a second test data set.

We give you instant accuracy feedback in the form of a percentage rank score. The score depends on how many correct unique people you can identify in the data. This way you can continue to refine your work and see how well you are holding up against your competitors.

(Jeff Jonas: you don't need the money, but I'll bet that Jaideep Singh, Jay Bhatti, and the other folks at Spock would love to get to know you! For those of you who don't know what I'm talking about, here's a collection of links to stories about Jeff Jonas' work on entity resolution.)


tags: web 2.0  | comments: 12   | Sphere It
submit:

 
Previous  |  Next

1 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/5427

» Spock Entity Extraction Competition from Most Casual Observer

The Spock people search engine is running a competition similar to the Netflix Prize. The Spock Challenge started at 9 AM this morning (April 16th), runs for four months, and has a grand prize of $50,000. Unfortunately, the criteria for... Read More

Comments: 12

  Robert Dewey [04.14.07 09:33 AM]

Spock looks interesting... It's definitely a step in the right direction, but I don't think it's the ultimate solution.

Spock will probably be really good at finding individuals that are at least somewhat popular. I'd be more interested in finding individuals that are closely related to me on the web, while having the ability to search for data through those individuals.

  dave mcclure [04.14.07 11:54 AM]

@Robert: although the examples Tim highlighted above are for well-known folks, actually Spock *DOES* have average joes in the system as well.

depending on the search criteria you enter, you should be able to find non-famous people quite easily, based on specific attributes or interests they have.

there's also a fun way to triangulate on people who share similar interests by doing multi-tag searches and finding out who shares your hobbies or habits.

[full disclosure: i'm an advisor for Spock... and a fan :) ]

  Robert Dewey [04.14.07 11:58 AM]

Thanks for the response, Dave. It should be pretty interesting - can't wait to see it come out of private beta.

  Steve Jones [04.14.07 12:14 PM]

As someone with the name Steve Jones I am excited to hear this!

I was pretty excited when I got in the New York Times last year but it is still almost impossible to find me on the web. On any given day you can search Google news and people with the name Steve Jones have almost surely won a sporting event, gone to jail and released a new record. At one point I almost changed my name!

  rick gregory [04.14.07 12:41 PM]

Interesting problem and a good move by the Spock team. But... if I want Michael Jackson the footbal player I can search for "michael jackson football". So, while the technological problem is tough and a lot of interesting stuff will likely fall out trying to solve it, is there a problem for the people searching vs a company like Spock?

I think the far more intersting uses will come from what Tim said in his previous post "In a lot of ways, my business is based on the ability to find the right person, the person who knows the most about a given topic and can write about it, or present about it at a conference, or point to other interesting people."

Ah.. now you're looking for a person with certain qualities... who has specific expertise or is associated with a domain of knowledge. That's much harder to find using Google... if Spock can do that it will be very useful.

  Tim O'Reilly [04.14.07 02:36 PM]

rick --

the point about Michael Jackson the football player vs. Michael Jackson the singer is that you might never see the former on a normal search engine because of the latter. This can make a big difference when searching for someone whose name you know, but not much else.

Imagine for a minute that you knew someone in grade school named Eric Schmidt. You don't know much of what happened in the 30 years since. If you're lucky, there's a web page that mentions the school, but if not, you're out of luck, unless you page a LONG way through google's results. He might now be "Eric Schmidt punk rocker" or "Eric Schmidt corporate attorney" but you don't know enough to do either search.

  Jeff Dalton [04.15.07 09:26 PM]

Search Engines can, in many cases, identify "clusters of meaning" for ambiguous queries and select a number of results from each cluster in order to reflect the real world diversity of meaning (similar to clustering results by website). The Michael Jackson situation can be possibly be handled with such clustering, utilizing searcher context (I just searched for NFL stats), personalization (I am a football fan), or in many cases simple query refinement. What's much more compelling about Spock is that goes far beyond simple search; it allows you to see relationships betweeen people and other entities (companies, topics, etc...) expressed as tags.

Spock reminds me of the DBLife project. DBLife is a more focused example, a database of database researchers and related entities mined from the web. For an example, see the entry on Jeff Jonas. DBLife has a rich set of relationships to other people and things (in my opinion better than the "tags" in Spock). DBLife is a prototype of CIMPLE, a collaborative information extraction platform, from the University of Wisconsin and Yahoo! Research.


  pwb [04.16.07 03:15 PM]

What ever happend to the NetFlix challenge?

  Gok Mop [04.16.07 06:33 PM]

Smart of Spock to take this approach. Entity resolution is *not* an easy problem, and it's a great idea to get anyone who has a novel approach to the problem to come out of the woodwork.

The problem with this stuff that I've seen in the past is context. Assume for a moment that the software is as smart as a human (which is a *huge* assumption, but play with me here for a minute...). There are many situations where the human doesn't have enough context in the data in order to tell.

People also probably shouldn't do entity resolution without extra weightings or criteria that indicate a bias against or towards false positives or false negatives. E.g. which is worse: incorrectly assuming that two "John Smith"'s are actually the same person, or incorrectly assuming that they are different people? Should you program your entity resolution to be "greedy" -- (err on the side of connecting people) or "miserly"? Depends on how the data will be used.

  Walter Underwood [04.16.07 08:02 PM]

The Netflix Prize is alive and well at http://www.netflixprize.com/ -- check out the leaderboard.

The Spock Challenge seems greedy to me. The Netflix Prize requires a non-exclusive license, while the Spock Challenge requires all rights to both code and algorithms for 1/20th the reward. You could easily spend the entire $50K prize just clearing patent rights.

I wrote a more thorough post about this on my blog.

Hey Tim, there is a trackback in your approval queue.

  mxtbcca [04.17.07 10:38 AM]

There's also Michael Jackson the beer critic, author - World Guide to Beer published 1977 and public television show host of the Beer Hunter series. More on Mike at http://www.beerhunter.com/

  People Finder [12.12.07 09:57 PM]

Spock.com may change this concept eventually, but "people search" for the foreseeable future will always be "people searches". So far, not one "people search" tool has proven to be the be-all end-all killer app for finding people, whether it be a free site like Spock.com or a paid database like accurint and others. Sure, some people can be found using one search or another, like free white pages or a free people search engine like Spock.com or Wink.com, but many hard to find people will still be found the old fashioned way, through databases ( both free and fee-based ) and the creativity and ingenuity of skip tracing, investigating and public records research.

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS