<br /><br /><h1>Introduction</h1> <p class="MsoNormal">The Bidder shall develop a web crawling and reporting tool using Nutch (http://lucene.apache.org/nutch/) distributed over a series of Linux workstations.<span> </span>The tool shall index and search the World Wide Web to identify sites that contain certain keywords and URLs (via regular expressions).<span> </span>These sites shall be cross referenced with information from a database and a public web service to join in additional site information such as traffic rank.<span> </span>The database shall be provided by the Requester.<span> </span>The tool shall generate a report displaying a categorized list of all sites containing specified keywords/URLs sorted by traffic rank or other attributes assigned to the identified sites.<span> </span>The keywords and URLs to be displayed in the report shall be entered into a configuration file using regular expressions.</p> <p class="MsoNormal">The tool shall run continuously, producing a new report daily based on the information most recently generated by the tool.</p> <p class="MsoNormal">The report shall be of the following example structure:</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">------------------------------------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">Site<span> </span><span> </span>Rank<span> </span>Other Attribute #1<span> </span>Other Attribute #2</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">------------------------------------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">“auto|car manufacturers”<span> </span></p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site1.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site1.com/', 250)" onmouseout="hideddrivetip()">www.site1.com</a><span> </span><span> </span>500<span> </span><span> </span><span> </span>103<span> </span>+7</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site2.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site2.com/', 250)" onmouseout="hideddrivetip()">www.site2.com</a><span> </span><span> </span>400 <span> </span>214<span> </span><span> </span>+9</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site3.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site3.com/', 250)" onmouseout="hideddrivetip()">www.site3.com</a><span> </span><span> </span>300<span> </span><span> </span>120<span> </span><span> </span>+77</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site4.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site4.com/', 250)" onmouseout="hideddrivetip()">www.site4.com</a><span> </span><span> </span>200<span> </span><span> </span>121<span> </span><span> </span>+13</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site5.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site5.com/', 250)" onmouseout="hideddrivetip()">www.site5.com</a><span> </span><span> </span>100<span> </span><span> </span>210<span> </span>-8</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">“fast cars”<span> </span></p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site4.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site4.com/', 250)" onmouseout="hideddrivetip()">www.site4.com</a><span> </span><span> </span>700<span> </span><span> </span>20<span> </span><span> </span>+4</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site7.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site7.com/', 250)" onmouseout="hideddrivetip()">www.site7.com</a><span> </span><span> </span>600<span> </span><span> </span>53<span> </span><span> </span>-10</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site8.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site8.com/', 250)" onmouseout="hideddrivetip()">www.site8.com</a><span> </span><span> </span>500<span> </span><span> </span>11<span> </span><span> </span>-8</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;"><a href="http://www.site9.com/" class="extlink" rel="nofollow" onmouseover="ddrivetip('You are about to go to a URL outside odesk.com - http://www.site9.com/', 250)" onmouseout="hideddrivetip()">www.site9.com</a><span> </span><span> </span>400<span> </span><span> </span>4<span> </span><span> </span>+5</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">www.site10.com<span> </span>300<span> </span><span> </span>52<span> </span><span> </span>+1<span> </span></p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">------------------------------------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">Total<span> </span><span> </span>998<span> </span><span> </span>908<span> </span><span> </span>90</p> <p class="MsoNormal" style="margin-bottom:.0001pt;line-height:normal;">------------------------------------------------------------------------------------------------------------------------------------------</p> <p class="MsoFootnoteText"><span class="MsoSubtleEmphasis"><span style="font-family:Calibri;">Source:</span></span> Fictitious data, for illustration purposes only</p> <p class="MsoNormal" style="text-align:center;">Figure 1 – Example Report</p> <p class="MsoNormal">The Bidder shall work at least four hours during U.S. Eastern Standard Time to facilitate communication with the Requester.</p> <p class="MsoNormal">The system shall be delivered in four phases, described below:</p> <h1>Phase 1 – Proof of Concept</h1> <p class="MsoNormal">The Phase 1 system shall operate on a single Linux workstation and will be limited to crawling and searching for specified keyword sets.<span> </span>The report generated shall resemble that shown in Figure 1, except that sites will be organized according to rank based on link counts rather than attributes selected from the database.<span> </span>Phase 1 shall be designed to be scalable to multiple servers, per Phase 3 below.</p> <h1>Phase 2 – Incorporate Site Attributes from Database and Web Service</h1> <p class="MsoNormal">In Phase 2 the attributes from the provided MySQL database shall be incorporated into the report, along with attributes from Amazon’s Alexa web service.<span> </span>The account to access the Alexa web service shall be provided by the Requester.<span> </span>Attributes shall include Alexa rank and Nielson volume estimate.</p> <h1>Phase 3 – Distributed Search and Algorithmic Tests</h1> <p class="MsoNormal">In Phase 3, the system shall be distributed across a family of servers, which will provide for much more extensive search and reporting capabilities.<span> </span>Phase 3 also includes certain tests on pages that match the search criteria, for example, testing of certain URLs and scripts embedded in the page match algorithmic criteria.<span> </span>These tests shall be specified as configuration parameters, for example, by using a scripting language such as Groovy, Jython, or similar (to be specified by Requester). The test results (e.g., pass/fail) shall be displayed in the report.</p> <h1>Phase 4 – Site Characterization</h1> <p class="MsoNormal">In Phase 4, Site Characterization Information (SCI) shall be included for each site.<span> </span>Initially, SCI shall be the top twenty index words (based on word frequency in the site, not including words specified in an index exclusion configuration file.<span> </span>The index exclusion configuration file shall be a line oriented text file, where each line is a regular expression.</p> <h1>Feature Summary (End of Phase 4)</h1> <p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Crawls the WWW and captures sites containing keywords/URLs specified using regular expressions in a text-based configuration file.</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Adds Site Characterization Information, i.e., metadata, for each site matching the keyword criteria.<span> </span>Initially, the SCI shall be the top twenty index words (by frequency), found in the site, excluding words or phrases that match regular expressions in a configurable exclusion file.</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Runs continuously, generating a daily report based on the recent information captured by the crawler.<span> </span>The report and system shall run continuously and reliably, without requiring any manual maintenance activities on the part of the Requester.</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Adds information to the site report captured via:</p> <p class="MsoListParagraphCxSpMiddle" style="margin-left:1in;text-indent:-.25in;"><span style="font-family:'Courier New';"><span>o<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>MySQL database provided by the Requester</p> <p class="MsoListParagraphCxSpMiddle" style="margin-left:1in;text-indent:-.25in;"><span style="font-family:'Courier New';"><span>o<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Amazon Alexa web service (bidder shall implement interface)</p> <p class="MsoListParagraphCxSpMiddle" style="margin-left:1in;text-indent:-.25in;"><span style="font-family:'Courier New';"><span>o<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Algorithmic tests running against pages in the sites that match the search criteria.<span> </span>The algorithmic tests shall be specific using a scripting language in a configuration file.</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Generates reports similar in format to that shown in Figure 1.<span> </span>The report shall be in ASCII text format.</p> <p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Supplementary tools to aid in maintenance and monitoring of the system shall be provided by the Bidder.<span> </span>However, the system shall operate reliably 24x7 without requiring manual maintenance activities.</p> <h1>Deliverables</h1> <p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;"><span><span>1.<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Source code, libraries, and scripts need to run and administer the application</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span><span>2.<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Demonstration of the system running continuously, generating reports as specified.</p> <p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;"><span><span>3.<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Documentation describing the use and maintenance of the system, including examples of the reports generated.</p> <h1>Development Environment</h1> <p class="MsoNormal">The development environment is:</p> <p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Linux OS, multiple workstations, accessed via ssh</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Java Language (Nutch is written in Java)</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Scripting via Bourne shell and Perl</p> <p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>MySQL database, loaded via csv files</p> <p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;"><span style="font-family:Symbol;"><span>·<span style="font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;font-size:7pt;line-height:normal;"> </span></span></span>Scripting language (Groovy, Jython, or similar)</p>