IST 511: Information and Technology Fall 2008 Proj #3 (DUE: Dec. 3) Last Updated: Mon Oct 29 18:28:52 EDT 2008 ______________________________________________________________________ Overview -------- The vertical search engine is a niche search engine that focuses on specific domains and/or businesses. Unlike the generic-purpose search engines such as MSN, Google, or Yahoo that aim at covering the entire Web as complete as possible, a vertical search engine drills down a focused area for deeper coverage (thus "vertical"). In Computer Science jargon, your search engine will employ depth-first search per se, instead of breadth-first search. In Proj #3, your task is to build a small vertical search engine using Apache Nutch toolkit. In a nutshell, your team needs to pick a domain to cover (e.g., College Football, US Used Cars, CA State Parks), crawl only relevant web pages, build index and DB, and provide keyword search capability using web interface. Set-Up ------ - A server (UNIX machine) is set up with the address ofist511.ist.psu.edu**CHANGED** ist511lws.ist.psu.edu (130.203.135.199). - The server can be accessed only via secure channel using SSH protocol. Download a SSH client from https://downloads.its.psu.edu/ => "File Transfer" - Tomcat and Nutch are already installed in the server (under each team's home directory). To start or stop Tomcat server, all you need to do is to type: start-tomcat stop-tomcat To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] Nutch is installed under each team's home directory (eg, /home/team-ID/nutch-0.9) so that each team CAN change the configuration freely. Modify things under "nutch-0.9/conf" to change the behavior of Nutch as you wish. - The server has the most of typical UNIX software installed, including: o wget: to download things using URL address o nano: a small editor which Windows users may find it useful/familiar o Emacs: full-fledged powerful UNIX editor - Each team shares a single UNIX account (ID/PWD to be given in class) to work on the project. - Each account can use roughly about 2GB disk space. Task ---- - Step 1: Try to running Tomcat and Nutch. Read relevant tutorial from the Web, play with them, and get familiar with the usage. - Step 2: Devise a logic (or algorithm) to make Nutch crawler to focus on specific domains only (e.g., crawl only domain with "*.psu.edu", crawl pages only if they contain a token like "used car"). Make your logic as sophisticated/effective as possible to avoid unnecessary crawling. Implement your logic and connect to Nutch - Step 3: Crawl pages according to the devised logic (watch out the space limit and robot exclusion protocol). - Step 4: Build indexes using Nutch after parsing, cleaning, and analyzing the crawled pages - Step 5: Devise your own query interfaces on the Web (both input and output) - Step 6: Hook up your query interfaces to indexes - Step 7: Bells and Whistles -- add at least ONE interesting and novel feature to your search engine. For instance, one can index more than HTML pages (e.g., PDF, Word, web services, image, video) like Google does - Step 8: Final report explaining how your team did for Steps 1-7 - Step 9: Project #3 Team demo & presentation onDec. 10** CHANGED** Dec. 3 o 30 minutes per team o All members must participate in the presentation ______________________________________________________________________ Part 2: Turn-In of Proj #3 Report and Presentation slide Your report should describe how you did for each step above in detail, and at the end describe how works have been split by team members (i.e., who did what) Do NOT make your report lengthy unnecessarily -- be succinct and get to the point. The length of your report has nothing to do with the score that you are getting. If your project team has some problems (eg, some members don't come to project meetings), you need to inform me early so that I can intervene. Otherwise, all team members share the identical scores for the projects. TURN-IN: Only *ONE* person from each team drops the report and slide to ANGEL once AND give me a hard-copy in class.