Nutch & Lucene Installation Instructions


This document contains instructions for downloading and installing Nutch and Lucene. Please beware that you must be logged into the csci571 computer to run Apache Tomcat and not on aludra or nunki.

  1. Downloading and Installing Nutch
  2. Downloading and Installing Lucene

Downloading and Installing Nutch
Chris A. Mattmann
mattmann@apache.org

Pre-requisites

  1. Installation of Java 1.4 or above. You can download java from http://java.sun.com

  2. Installation of Apache ANT 1.6 or above. You can download ANT from http://ant.apache.org
  3. Installation of Apache Tomcat 5.5.19 or above. You can download Tomcat from http://tomcat.apache.org
  4. If you are using Windows OS, please install Cygwin: you can find Cygwin here: http://www.cygwin.com/
  5. Install the subversion client, You can find Subversion at: http://subversion.tigris.org

 

Installation Instructions

  1. Download Nutch from SVN, using the Subversion command line client:

    # svn co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-0.8.1/ ./nutch

    1. This will install nutch into a directory called “nutch” local to wherever you ran this command. So, if you ran this command from /home/bogus, then you would have a directory called /home/bogus/nutch
    2. We’ll call the directory where you unpacked Nutch to your $NUTCH_HOME
  2. Cd into the Nutch directory, and compile Nutch:

    # cd nutch
    # ant
    1. You should see a message such as the following if all is well and the build ran successfully

      compile:

      job:
      [jar] Building jar: /Users/mattmann/tmp/nutch/build/nutch-0.8.1.job

      BUILD SUCCESSFUL

      Total time: 27 second
  1. Okay, now that Nutch is built, you can fetch some content. There is a detailed, step-by-step set of instructions on the wiki, for how to fetch content. This page provides all the details: http://wiki.apache.org/nutch/NutchTutorial

  2. Once you’ve fetched some content, you’ll probably want to browse it. To get Nutch set up on Tomcat, first build the Nutch webapp (run the below command from your nutch directory):

    # ant war
  3. The above command will construct a nutch-0.8.1.war file within $NUTCH_HOME/build. It will also construct a nutch.xml file within $NUTCH_HOME/build. The nutch.xml is a Tomcat context.xml file, that you can use to configure a WAR file for deployment within Tomcat.
  4. First, make a directory for your nutch war file, and your nutch context.xml file to live in. /usr/local/nutch is a good place.

    # mkdir /usr/local/nutch
    # cp –R $NUTCH_HOME/build/nutch-0.8.1.war /usr/local/nutch
    # cp –R $NUTCH_HOME/build/nutch.xml /usr/local/nutch
  5. Next, edit your /usr/local/nutch/nutch.xml file

    Inside the file, modify the property searcher.dir to the path where your Nutch index that you created separately (in step 3 above) exists. If that directory is /home/bogus/nutch/my.crawl, then you would set searcher.dir to /home/bogus/nutch/my.crawl.
  6. Edit your /usr/local/nutch/nutch.xml file again

    Edit the docBase attribute on the Context tag to be the FULL path to your Nutch 0.8.1 WAR file, e.g., /usr/local/nutch/nutch-0.8.1.war
  7. Now, assuming that you have installed Tomcat according to the pre-requisites, and assuming that you have set the $TOMCAT_HOME environment variable (that points to your Tomcat installation directory), first shutdown tomcat:

    # cd $TOMCAT_HOME/bin
    # ./shutdown.sh

    Now, symlink your context.xml file for Nutch to the Tomcat conf directory

    # ln -s /usr/local/nutch/nutch.xml $TOMCAT_HOME/conf/Catalina/localhost/nutch.xml

    Now, restart your Tomcat server:

    # cd $TOMCAT_HOME/bin
    # ./startup.sh
  8. If everything went right in step 9, then you should open up your browser, and point it at your tomcat installation (e.g., http://localhost:8080), and then append the path “/nutch” at the end of it. So, if you installed tomcat to run on port 8080, then you would visit: http://localhost:8080/nutch/

That’s it!

If you have any further questions, please feel free to contact me at the email address provided above.