IST 511: Information and Technology

                           Fall 2008

                     Proj #3 (DUE: Dec. 3)

          Last Updated: Mon Oct  29 18:28:52 EDT 2008

______________________________________________________________________

Overview
--------

The vertical search engine is a niche search engine that focuses on
specific domains and/or businesses. Unlike the generic-purpose search
engines such as MSN, Google, or Yahoo that aim at covering the entire
Web as complete as possible, a vertical search engine drills down a
focused area for deeper coverage (thus "vertical"). In Computer
Science jargon, your search engine will employ depth-first search per
se, instead of breadth-first search. In Proj #3, your task is to build
a small vertical search engine using Apache Nutch toolkit. In a
nutshell, your team needs to pick a domain to cover (e.g., College
Football, US Used Cars, CA State Parks), crawl only relevant web
pages, build index and DB, and provide keyword search capability using
web interface.

Set-Up
------

- A server (UNIX machine) is set up with the address of
  ist511.ist.psu.edu **CHANGED** ist511lws.ist.psu.edu (130.203.135.199).

- The server can be accessed only via secure channel using SSH
  protocol. Download a SSH client from https://downloads.its.psu.edu/
  => "File Transfer"
    
- Tomcat and Nutch are already installed in the server (under each team's 
  home directory). To start or stop Tomcat server, all you need to do is
  to type:

    start-tomcat
    stop-tomcat

  To run Nutch, at the command line, just type:

    nutch 

  or you can provide various parameters like:

    nutch [parameters]
  
  Nutch is installed under each team's home directory (eg,
  /home/team-ID/nutch-0.9) so that each team CAN change the
  configuration freely. Modify things under "nutch-0.9/conf" to change
  the behavior of Nutch as you wish.

- The server has the most of typical UNIX software installed, including:
  o wget: to download things using URL address
  o nano: a small editor which Windows users may find it useful/familiar
  o Emacs: full-fledged powerful UNIX editor

- Each team shares a single UNIX account (ID/PWD to be given in class)
  to work on the project.  

- Each account can use roughly about 2GB disk space.

Task
----

- Step 1: Try to running Tomcat and Nutch. Read relevant tutorial from
the Web, play with them, and get familiar with the usage.

- Step 2: Devise a logic (or algorithm) to make Nutch crawler to focus
on specific domains only (e.g., crawl only domain with "*.psu.edu",
crawl pages only if they contain a token like "used car"). Make your
logic as sophisticated/effective as possible to avoid unnecessary
crawling. Implement your logic and connect to Nutch

- Step 3: Crawl pages according to the devised logic (watch out the
space limit and robot exclusion protocol).

- Step 4: Build indexes using Nutch after parsing, cleaning, and
analyzing the crawled pages

- Step 5: Devise your own query interfaces on the Web (both input and
output)

- Step 6: Hook up your query interfaces to indexes

- Step 7: Bells and Whistles -- add at least ONE interesting and novel
feature to your search engine. For instance, one can index more than
HTML pages (e.g., PDF, Word, web services, image, video) like Google
does

- Step 8: Final report explaining how your team did for Steps 1-7

- Step 9: Project #3 Team demo & presentation on Dec. 10 ** CHANGED** Dec. 3

  o 30 minutes per team
  o All members must participate in the presentation

______________________________________________________________________

Part 2: Turn-In of Proj #3 Report and Presentation slide

Your report should describe how you did for each step above in detail,
and at the end describe how works have been split by team members
(i.e., who did what)

Do NOT make your report lengthy unnecessarily -- be succinct and
get to the point. The length of your report has nothing to do
with the score that you are getting.

If your project team has some problems (eg, some members don't
come to project meetings), you need to inform me early so that I
can intervene. Otherwise, all team members share the identical
scores for the projects.

TURN-IN: Only *ONE* person from each team drops the report and slide
to ANGEL once AND give me a hard-copy in class.