Nutch is a Java based Web-Search engine. While it can run on clusters of hundreds of machines it can also be run on a single host and can provide search results via a few JSP pages provided with nutch.
Crawling would be accomplished by something like <code>./bin/nutch crawl starturls.txt -dir crawl -depth 2 -topN 30000</code> and the HTML interface by dropping <code>nutch-1.0.war</code> into you favorite servlet container (I use Jetty).
Your task is to buils a JSP single page allowing to view statistis about the current search index. For that you need to use the lucene API. Probably the study of the sourcecode of the tool "Luke" can show you exactly how to query the index (see http://www.getopt.org/luke/#)
The page should display
* number of documents
* number of terms
* index last modified. Date in http://www.faqs.org/rfcs/rfc3339.html format
* Any statistics you can get on the crawldb. http://is.gd/4Q7Jp http://issues.apache.org/jira/browse/NUTCH-558 and http://is.gd/4Q7Ny might provide pointers
This page will be used by us to monitor if the nutch instance is "healty", still adding pages etc. Nutch is run on an intranet spidering about two dozen hosts.
接包方 | 国家/地区 | |
---|---|---|
3
Iphonevogue
(中标)
|
||
3
Freelanser
|
||
3
Newgenapps
|
||
3
Space2010
|
||
3
Aeroweb
|
||
2
Shain
|
||
2
Endeavoursoftware
|
||
2
Seaant
|