Project #3: People Name
- Team A:
Dachapratumvan, Audrey Lim, and Michael K. Hills
- Team B: Wendy Xie, Suwan
Juntiwasarakij, Bob Stark, Bo Yu, and Louise M.
- Team C:
Du, and Bi Chen
- Team D: Dragan
Zhang, and Jonah
- Team E: Kang
Hafernik, Haibin Liu, and Kun Chen
search based on people’s names has been always
one of the most popular query types.
recent study reports that around 30% of search engine queries
involve some form
of people names. However, people names are inherently ambiguous (e.g.,
distinct names are shared by 100 million people in US). For instance,
is looking for recent articles of a mathematician “John
Doe” from Penn State
using Google, returned web pages may include: (1) several web pages of
mathematician (with slightly different name spellings such as
Doe”, or “Dr. John Arthur Doe”), (2) web
pages of several different “John Doe”s
mixed due to the same name spelling, or (3) both cases mixed. With
such messy results, one has to sift through
pages to find the right home page of Penn State mathematician
“John Doe”. Ideally,
search engines should present a group of pages such that each group
pages of a single unique person, regardless of variations of names.
given web pages with people names appeared, being able to determine
names actually refer to the same person or not is a key to realize the
Such a problem is often known as Web People Name
Disambiguation (WPND) problem: i.e., given
N web pages returned for
a name query X, group N pages into K clusters such that pages within
cluster refer to the same real people while pages across clusters refer
different real people. Note that one of the challenges of the WPND problem is that the number of clusters (i.e., K) is NOT known a priori. Since
the WPND problem has both
theoretical interests and practical implications, many interesting
results start to appear in recent years. Furthermore, recent contests
focusing on the WPND problem have attracted a lot of attentions, too.
For instance, see the recent SPOCK
challenge for the award of $50,000.
#3, your task is to solve the WPND problem using a small test data set
from a recent academic competition to solve the WPND problem, called
the Web People
Search Track (WePS). Please, read the detailed task
description of WePS. To solve the WPND problem, students may:
(1) improve the given baseline solution developed at PSU for 2007
competition -- "PSNUS:
Web People Name Disambiguation by Simple Clustering with Rich Features"
further, (2) apply any other data mining techniques for better
clustering (e.g., K-means), (3) clean/process test data
further for improved results, or (4) any mix of these, etc. As long as
your solution can cluster mixed web pages of WePS data set better, you
are allowed to use ANY techniques. The better results your team will
get (compared to the PSNUS's results), the higher score you will get
(and you might be invited to join the PSU's research team for
the next WePS competition in 2008).
Unlike projects #1 and #2, project #3 is a research-intensive and
some-coding-involved project. The task is substantially MORE DIFFICULT
than projects #1 and #2, so start EARLY !!
graduate student, Ergin Elmacioglu (email@example.com), who led
the PSNUS effort for the WePS competition, will be helping students for
project #3. Please direct your questions to Ergin via email first.
server machine (UNIX machine) will be used for the project: ist511.ist.psu.edu
server can be accessed only via secure channel using SSH protocol
a SSH client from https://downloads.its.psu.edu/
=> "File Transfer"
WePS related codes and data are already installed in the server,
file: contains detailed infomation about each sub-directory and its
directory: Test data are at "/home/wnam/weps/weps-data". However, do NOT copy the data to
your directory since its size is huge -- instead, since all data
are readable, use them directly from the
"/home/wnam/weps/weps-data" path. Read the INFO.txt file in it for more details of
the data set.
directory: This directory includes the evaluation script for your
directory: This directory includes the code and processed WePS data of
the PSNUS team's clustering scheme.
directory: CLUTO is a data mining software package, developed at
University of Minnesota, for clustering low- and high-dimensional
datasets and provides various clustering and analysis algorithms. Refer
to the manual.pdf for the description of the tool and input/output file
formats. CLUTO is installed for your convenience (you may use any other
data mining software).
server has most typical UNIX software, including:
to download things using URL address
a small editor which Windows users may find it useful/familiar
full-fledged UNIX editor
team shares a single UNIX account (ID/PWD to be given in class) to work
on the project.
account can use upto 3GB disk space.
Turn-In @ ANGEL Drop Box
using any software (from open source data mining software or your own
codes), cluster the given WePS data set to the correct sub-groups such
that each sub-group consists of only web pages for a single person.
- Turn in both your final
report and (zipped codes if you wrote one) at ANGEL by
due date/time (HARD DEADLINE)
report should describe how you did for the specified task, and
at the end describe how works have been split by team members (i.e.,
who did what)
- Unless there is a problem between team
members, both members will share the same project score.