As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com

Search Indexing Robots: Books and Articles


InfoSpiders: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery (ARACHNID) October, 2001
University of Iowa work on issues of intelligent agents and adaptive spiders. Examples as Java Applets.
 
White Paper : The robots.txt file and the robots meta tag SearchMechanics / eBrandManagement.com, September 2000.
Practical descriptions for the webmaster on how the robots instructions are treated by search engine robots and other crawlers.
Programming Bots, Spiders and Intelligent Agents in MS Visual C++ David Pallmann, Microsoft Press, 1999
Provides context for proper use of robots on the Web, C++ and MFC examples for various kinds of agents, including site-indexing, advanced topics include multithreading, adaptation, logging, notification, etc. Knowledge of network programming and Internet protocols not required: relies on waning and MSIE heavily. Get the book from Amazon and give this site the affiliate fee.
 
Mercator: A Scalable, Extensible Web Crawler World Wide Web, volume 2 (1999), number 4 (December) by Allan Heydon and Marc Najork
Describes the design and architecture of a scalable multi-server robot crawler, modularization, including filtering by type, extracting links, queuing, testing for duplicates, domain name resolution and alias host names, testing for multiple links to the same page, threading and synchronous I/O, session IDs, and more.
 
 
Robots and Spiders and Crawlers Ultraseek White Paper, September 1999
Detailed discussion of how search engine indexing robots follow links and read Web pages to store the information in search indexes. Includes coverage of problem areas such as image maps, frames, JavaScript and dynamic data. Notes describe how the Ultraseek Spider handles these problems.
 
Controlling Search Engines ZDnet devhead / Interactive Designer, January 25, 1999
Nice article about using META tags.
 
Brace Your Site for the Onslaught of Bots ZDnet devhead, November 1, 1997 by David S. Linthicum
Information for Web site managers about site-spidering robots, including IE 4's subscription bot and robots.txt.

The official guidelines were written up in 1996 or so:


For more information about robots on the SearchTools Site:

Robots Information
Introduction to web crawling robots for search indexing and other purposes
Robot Exclusion Protocol (REP)
Information on the original protocol and the June 2008 search engines extensions
Elements of Robots.txt
Detailed description of the robots.txt directives and path options
Robots.txt Details
Practical notes on implementing robots.txt
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problems

Page Updated 2008-06-13