This document contains instructions for downloading and installing Nutch and Lucene. Please beware that you must be logged into the csci571 computer to run Apache Tomcat and not on aludra or nunki.
This will install nutch into a directory called “nutch” local to wherever you ran this command. So, if you ran this command from /home/bogus, then you would have a directory called /home/bogus/nutch
We’ll call the directory where you unpacked Nutch to your $NUTCH_HOME
Cd into the Nutch directory, and compile Nutch:
# cd nutch
# ant
You should see a message such as the following if all is well and the build ran successfully
compile:
job:
[jar] Building jar: /Users/mattmann/tmp/nutch/build/nutch-0.8.1.job
BUILD SUCCESSFUL
Total time: 27 second
Okay, now that Nutch is built, you can fetch some content. There is a detailed, step-by-step set of instructions on the wiki, for how to fetch content. This page provides all the details: http://wiki.apache.org/nutch/NutchTutorial
Once you’ve fetched some content, you’ll probably want to browse it. To get Nutch set up on Tomcat, first build the Nutch webapp (run the below command from your nutch directory):
# ant war
The above command will construct a nutch-0.8.1.war file within $NUTCH_HOME/build. It will also construct a nutch.xml file within $NUTCH_HOME/build. The nutch.xml is a Tomcat context.xml file, that you can use to configure a WAR file for deployment within Tomcat.
First, make a directory for your nutch war file, and your nutch context.xml file to live in. /usr/local/nutch is a good place.
Inside the file, modify the property searcher.dir to the path where your Nutch index that you created separately (in step 3 above) exists. If that directory is /home/bogus/nutch/my.crawl, then you would set searcher.dir to /home/bogus/nutch/my.crawl.
Edit your /usr/local/nutch/nutch.xml file again
Edit the docBase attribute on the Context tag to be the FULL path to your Nutch 0.8.1 WAR file, e.g., /usr/local/nutch/nutch-0.8.1.war
Now, assuming that you have installed Tomcat according to the pre-requisites, and assuming that you have set the $TOMCAT_HOME environment variable (that points to your Tomcat installation directory), first shutdown tomcat:
# cd $TOMCAT_HOME/bin
# ./shutdown.sh
Now, symlink your context.xml file for Nutch to the Tomcat conf directory
If everything went right in step 9, then you should open up your browser, and point it at your tomcat installation (e.g., http://localhost:8080), and then append the path “/nutch” at the end of it. So, if you installed tomcat to run on port 8080, then you would visit: http://localhost:8080/nutch/
That’s it!
If you have any further questions, please feel free to contact me at the email address provided above.