As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com

About Search Indexing Robots and Spiders


Many search engines use programs called robots to locate web pages for indexing. These programs are not limited to a pre-defined list of web pages, instead they follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering, wandering, or gathering. Once they have a page or document, the parsing and indexing o fthe page begins

Controlling Robot Indexing
Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. Web publishers can control which directories the robots should index by editing the robots.txt file, and web page creators can control robot indexing behavior using the Robots META tag.

There is a June 2008 Robots Exclusion Protocol set of features agreed on by Yahoo, Google, and Microsoft Live search (MSN). For more details, see our REP report.
Following Links
Local search robot spider indexers locate files to index by following links, just like webwide search engine spiders. You can specify the starting page, and these indexers will request it from the server and received it just like a browser. The indexer will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages.
Link Problems
They will miss pages which have been accidentally unlinked from any of your starting points. And spiders will have problems with JavaScript links, just like webwide search engine robots.
Dynamic Elements
Robot spider indexers will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.
Most site search and webwide search engines can handle dynamic URLs (including question marks ? and other punctuation). However, there are others that will not follow these links: for help building plain URLs, see our page on Generating Simple URLs .
Server Load
Because they use HTTP, robot spider indexers can be slower than local file indexers, and can put more pressure on your web server, as they ask for each page. Some older webservers may crash during this process, either from the number of requests or because they uncover file corruption.
Updating Indexes
To update the index, some robot spider will query the web server about the status of each linked page by asking for the HTTP header using a "HEAD" request (the usual request for an HTML page is a "GET"). For HEAD requests, the server may be able to send the page header information from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the indexer compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn't have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word. An alternate solution is for robot spiders to send an "If-Modified-Since" request with the previous date they have stored as the file date. this HTTP/1.1 header option allows the web server to send back a code if the page has not changed, and the entire page if it has changed.
Duplicate Files
Robots should contain special code to check for duplicate pages, due to server mirroring, alternate default page names, mistakes in relative file naming (./ instead of ../, for example), and so on. Some search indexers have powerful algorithms to identify these duplicates and only store and search one cop

For more information about robots on the SearchTools Site:

Robots Information
Introduction to web crawling robots for search indexing and other purposes
Robot Exclusion Protocol (REP)
Information on the original protocol and the June 2008 search engines extensions
Elements of Robots.txt
Detailed description of the robots.txt directives and path options
Robots.txt Details
Practical notes on implementing robots.txt
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problems

Robots Mailing List - for writers of web robots

To subscribe, send a message to listserv@mccmedia.com with the words subscribe robots (your name) in the message body.
For mailing list help, see the Listserv help message.
To view earlier messages, see the Archive of discussions, 1995-1997
Last Modified: 2008-06-13