IST 511
Project #3: People Name Disambiguation
(DUE: Dec. 18, 11AM)


Team Assignment
Overview
Web search based on people’s names has been always one of the most popular query types.  A recent study reports that around 30% of search engine queries involve some form of people names. However, people names are inherently ambiguous (e.g., about 90,000 distinct names are shared by 100 million people in US). For instance, when one is looking for recent articles of a mathematician “John Doe” from Penn State using Google, returned web pages may include: (1) several web pages of the very mathematician (with slightly different name spellings such as “John Doe”, “J. Doe”, or “Dr. John Arthur Doe”), (2) web pages of several different “John Doe”s mixed due to the same name spelling, or (3) both cases mixed. With such messy results, one has to sift through pages to find the right home page of Penn State mathematician “John Doe”. Ideally, search engines should present a group of pages such that each group contains only pages of a single unique person, regardless of variations of names. Therefore, given web pages with people names appeared, being able to determine whether two names actually refer to the same person or not is a key to realize the ideal search solution.

Such a problem is often known as 
Web People Name Disambiguation (WPND) problem: i.e., given N web pages returned for a name query X, group N pages into K clusters such that pages within each cluster refer to the same real people while pages across clusters refer to different real people. Note that one of the challenges of the WPND problem is that the number of clusters (i.e., K) is NOT known a priori. Since the WPND problem has both theoretical interests and practical implications, many interesting results start to appear in recent years. Furthermore, recent contests focusing on the WPND problem have attracted a lot of attentions, too. For instance, see the recent SPOCK challenge for the award of $50,000.

In project #3, your task is to solve the WPND problem using a small test data set from a recent academic competition to solve the WPND problem, called the Web People Search Track (WePS). Please, read the detailed task description of WePS. To solve the WPND problem, students may: (1) improve the given baseline solution developed at PSU for 2007 competition -- "PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features" -- further, (2) apply any other data mining techniques for better clustering (e.g., K-means), (3) clean/process test data further for improved results, or (4) any mix of these, etc. As long as your solution can cluster mixed web pages of WePS data set better, you are allowed to use ANY techniques. The better results your team will get (compared to the PSNUS's results), the higher score you will get (and you might be invited to join the  PSU's research team for the next WePS competition in 2008).

Unlike projects #1 and #2, project #3 is a research-intensive and some-coding-involved project. The task is substantially MORE DIFFICULT than projects #1 and #2, so start EARLY !!


TA
A CSE graduate student, Ergin Elmacioglu (elmaciog@cse.psu.edu), who led the PSNUS effort for the WePS competition, will be helping students for project #3. Please direct your questions to Ergin via email first.

Set-Up
Task
Turn-In @ ANGEL Drop Box

The PIKE Group