As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com

The Robots Exclusion Protocol (REP)


This robots protocol is agreement among people writing robots (mainly for search engines) and people publishing web sites, to give the site owner a way to communicate with and control the robots.

Being voluntary, it is a success because most sites want to be indexed and searched. However, there are some robots which ignore the directives, so any private content should be protected by authentication and access control (user name and password).


The Robots Exclusion Protocol developed in 1994-1997

This descrobes robots.txt file with a Disallow directive to indicate which directories do not welcome robots. It also describes robots META tags, indicating if a page should not be indexed, and/or the links on the page should be followed.


Robots Exclusion Protocol, June 2008 Agreement

In June, 2008, Yahoo, Google and MSN Live Search all announced that they would support common extensions to the REP, based on their individual experiences with devations from the protocol.

(Why these are are all blog announcements rather than actual pages is beyond me. -- Editor)

Robot IP authentication

This new feature is is helpful when another robot is masquerading as a search indexing robot, but sending multiple requests per second, not following the directives in robots.txt, or otherwise causing site problems. If the robot's IP address is not within a search engines corporate IP blocks, it may be a rogue, but can be blocked without fear of losing the site's place in the search engine index. Yahoo has a nice step-by-step explanation of reverse DNS authentication.

Instructions in robots.txt - all directives to the robot crawler

X-Robots-Tag for non-HTML documents

The problem with META Robots tags has been that they can only appear in HTML documents, not plain text, PDF, office documents, audio, or video files.

With this new system, a web server can send the X-Robots-Tag in the HTTP header, which is a standard but invisible part of any response to a browser or robot request. This is not something anyone can type in by hand, it's an automated system. But that also means that webmasters can program this tag into the HTTP header sent for any page, image, video or other document.

Apache, php, and other web servers can easily be configured send an HTTP header. The syntax is almost exactly the same as the META robots tag: NOINDEX, NOFOLLOW, etc. Apparently, the consensus is that the header tag should have priority the META tags, although I'm not sure this is a good idea. We will be testing how search engine robots implement priorities in the real world.

META Robots tags - crawler and indexer directive

new Robots-nocontent Attribute (Yahoo Only)

A class or CSS attribute that surrounds text which shouldn't be indexed because it's full of repetitive navigation text, syndicated stories, or other content not really appropriate for search.

This is similar to a site-search convention using the pseudo-tag <!-- noindex -->junk text<!-- /noindex--> and Ultraseek's <!--stopindex--><!--startindex-->


Other proposed extensions of the Robots.txt standard:


For more information about robots on the SearchTools Site:

Robots Information Page
Summary of the most important things about web crawling robots
Robots.txt Page
Specific information on entries in the robots.txt file, old and new
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problems.
Page Updated 2008-07-03