This HOW-TO consists of the following:
Prerequisites I'll assume that you have an Ubuntu 10.04 server installed and that you are logged in as root while working on this sudo su - Installing Solr Luckily, solr 1.4 is present in APT! apt-get install solr-common solr-tomcat tomcat6 Now please refer following steps to setup your tomcat manager which is very useful in future! sudo apt-get install tomcat6-admin Edit /var/lib/tomcat6/conf/tomcat-users.xml <tomcat-users> <!-- <role rolename="tomcat"/> <role rolename="role1"/> <user username="tomcat" password="tomcat" roles="tomcat"/> <user username="both" password="tomcat" roles="tomcat,role1"/> <user username="role1" password="tomcat" roles="role1"/> --> </tomcat-users> To this: <tomcat-users> <!-- <role rolename="tomcat"/> <role rolename="role1"/> <role rolename="manager"/> <user username="tomcat" password="tomcat" roles="tomcat,manager"/> <user username="both" password="tomcat" roles="tomcat,role1"/> <user username="role1" password="tomcat" roles="role1"/> --> </tomcat-users> Now, restart Tomcat: sudo service tomcat6 restart You can access tomcat manager on http://localhost:8080/manager/html Username: tomcat Password: tomcat Installing Nutch Go to a proper working directory, download and unpack Nutch cd /tmp wget http://mirrorservice.nomedia.no/apache.org/nutch/apache-nutch-1.1-bin.tar.gz cd /usr/share tar zxf /tmp/apache-nutch-1.1-bin.tar.gz ln -s apache-nutch-1.1-bin nutch Configuring Solr For the sake of simplicity we are going to use the example configuration of Solr as a base. Back up the original file: mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig And replace the Solr schema with the one provided by Nutch cp /usr/share/nutch/conf/schema.xml /etc/solr/conf/schema.xml Now, we need to configure Solr to create snippets for search results Edit /etc/solr/conf/schema.xml and change the following line: <field name="content" type="text" stored="false" indexed="true"/> To this: <field name="content" type="text" stored="true" indexed="true"/> Create a new dismax request handler, to enabling relevancy tweaks. Back up the original file: cp /etc/solr/conf/solrconfig.xml /etc/solr/conf/solrconfig.xml.orig Add the following fragment to _/etc/solr/conf/solrconfig.xml_: <requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <str name="tie">0.01</str> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <str name="ps">100</str> <str name="hl">true</str> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler> Now, restart Tomcat: service tomcat6 restart Configuring Nutch Go into the nutch directory and do all the work from there: cd /usr/share/nutch Edit conf/nutch-site.xml and add the following in between the <configuration>-clauses: <property> <name>http.robots.agents</name> <value>nutch-solr-integration-test,*</value> <description></description> </property> <property> <name>http.agent.name</name> <value>nutch-solr-integration-test</value> <description>Viterbi Bot</description> </property> <property> <name>http.agent.description</name> <value>Viterbi Web Crawler using Nutch 1.0</value> <description></description> </property> <property> <name>http.agent.url</name> <value>http://viterbi.usc.edu/</value> <description></description> </property> <property> <name>http.agent.email</name> <value>YOUR EMAIL ADDRESS HERE</value> <description></description> </property> <property> <name>http.agent.version</name> <value></value> <description></description> </property> <property> <name>generate.max.per.host</name> <value>100</value> </property> You need to ensure that the crawler does not leave our domain, otherwise you would end up crawling the entire Internet. You need to insert domain into _conf/regex-urlfilter.txt: # allow urls in viterbi.usc.edu domain +^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/([a-z0-9\-A-Z]*\/)* # deny anything else -. **Important: Make sure that you Edit this: # accept anything else +. To this: # accept anything else #+. Now, we need to instruct the crawler where to start crawling, so create a seed list: mkdir urls echo "http://viterbi.usc.edu/" > urls/seed.txt **Important: Here you can add multiple seed urls per line and make sure that you make corresponding changes in regex-urlfilter.txt discussed above Crawling your site Let's start crawling! Start by injecting the seed url(s) to the nutch crawldb: bin/nutch inject crawl/crawldb urls Next, generate fetch list: bin/nutch generate crawl/crawldb crawl/segments The above command generated a new segment directory under /usr/share/nutch/crawl/segments that contains the urls to be fetched. All following commands require accessing the latest segment directory as their main parameter so we’ll store it in an environment variable: export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` Launch the crawler! bin/nutch fetch $SEGMENT -noParsing And parse the fetched content: bin/nutch parse $SEGMENT Now we need to update the crawl database to ensure that for all future crawls, Nutch only checks the already crawled pages, and only fetches new and changed pages. bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize Create a link database: bin/nutch invertlinks crawl/linkdb -dir crawl/segments **Important: The more number of times you repeat above crawling steps you will get better crawl depth! Indexing our crawl DB with solr bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/* Search the crawled content in Solr Now the indexed content is available through Solr. You can try to execute searches from the Solr admin UI from http://127.0.0.1:8080/solr/admin or directly with url like: http://127.0.0.1:8080/solr/select/?q=usc&version=2.2&start=0&rows=10&indent=on&wt=json |