Project #2: Vertical
(DUE: Oct. 30, Nov. 6, 11AM)
Team Assignment (4 or 5)
Xiaoyan (Wendy) Xie, Wendy Yao, Alice Shapiro,
Dragan Bogunovic, and Kang Zhao
Nunmanus Dachapratumvan, Shuguang Suo, Kun Chen, and Suwan
Carolyn Hafernik, Michael K. Hills, Louise M. Campbell,
Mohammed Jarrahi, and Edward Su
Haibin Liu, Shaoke Zhang, Bo Yu, Honglu
Du, and Bi Chen
Audrey Lim, Puck Treeratpituk, Bob Stark, and Jonah
is a niche search engine that focuses on specific domains
and/or businesses. Unlike MSN, Google, or Yahoo that aim at covering
entire Web as complete as possible, a vertical search engine drills
down a focused area for
deeper coverage (thus "vertical"). In Computer Science jargon, your
search engine will employ depth-first search per se, instead of
breadth-first search. In project #2, your task is to build a
small-sized (less than 3 GB) vertical search engine using Apache Nutch
toolkit. In a nutshell, your team needs to pick a domain to cover
(e.g., Penn State Sports, PA Used Cars, PA State Parks), crawl only
relevant web pages, build index and DB, and provide keyword search
capability using web interface.
- A server (UNIX machine) is set up with the address of ist511.ist.psu.edu (184.108.40.206).
- The server can be accessed only via secure channel using SSH protocol
- download a SSH client from https://downloads.its.psu.edu/ => "File Transfer"
- Web server (Tomcat
V 6.0.14) is already installed. Nutch V 0.9 is downloaded to the
server, but not yet copied/installed to each team's account. More detailed Nutch tutorial will be given on Oct. 2.
- The server has most typical UNIX software, including:
- wget: to download things using URL address
- nano: a small editor which Windows users may find it useful/familiar
- Emacs: full-fledged UNIX editor
- Each team shares a single UNIX account (ID/PWD to be given in class) to work on the project.
- Each account can use upto 3GB disk space.
- Step 1: Install/configure Apache Nutch. Play with them
2: Devise a logic (or algorithm) to make Nutch crawler to focus on
specific domains only (e.g., crawl only domain with "*.psu.edu", crawl
pages only if they contain a token like "used car"). Make your logic as
sophisticated/effective as possible to avoid unnecessary crawling.
Implement your logic and connect to Nutch
- Step 3: Crawl pages according to the devised logic (watch out 3GB space limit and robot exclusion protocol).
- Step 4: Build indexes using Nutch after parsing, cleaning, and analyzing the crawled pages
- Step 5: Devise your own query interfaces on the Web (both input and output)
- Step 6: Hook up your query interfaces to indexes
- Step 7: Bells and Whistles -- add at least ONE interesting and novel feature to your search engine
- For instance, one can index more than HTML pages (e.g., PDF, Word, web services, image, video) like Google does
- Step 8: Final report explaining how your team did for Steps 1-7
Turn-In @ ANGEL Drop Box
- Step 9: Project #2 Team demo on Nov. 6
- 30 minutes per team
- All members must participate in the presentation
- Turn in both your final report and presentation slides at ANGEL by
due date/time (HARD DEADLINE)
report should describe how you did for each step above in detail, and
at the end describe how works have been split by team members (i.e.,
who did what)
- Also, pls include one component that can show how you used the given TabletPC among project members (e.g., pen-based drawing of your overall proj#2 architecture, hand-writing memo from your project meetings, etc)
- Your presentation slides should take no more than 30 minutes
- Unless there is a problem between team
members, both members will share the same project score.