As of January, 2012, this site is no longer being updated, due to work and health issues
This robots protocol is agreement among people writing robots (mainly for search engines) and people publishing web sites, to give the site owner a way to communicate with and control the robots.
Being voluntary, it is a success because most sites want to be indexed and searched. However, there are some robots which ignore the directives, so any private content should be protected by authentication and access control (user name and password).
The Robots Exclusion Protocol developed in 1994-1997
This descrobes robots.txt file with a Disallow directive to indicate which directories do not welcome robots. It also describes robots META tags, indicating if a page should not be indexed, and/or the links on the page should be followed.
- About /robots.txt at robotstxt.org
- A Standard for Robot Exclusion 1994 consensus of robot authors on the robots mailing list
- 1996 Internet Draft RFC
- W3C HTML 4 Recommendations for robots.txt and robots META tags.
Robots Exclusion Protocol, June 2008 Agreement
In June, 2008, Yahoo, Google and MSN Live Search all announced that they would support common extensions to the REP, based on their individual experiences with devations from the protocol.
- Google Search Blog Announcement
- Yahoo Search Blog Announcement
- Microsoft (MSN) Live Search Blog Announcement
(Why these are are all blog announcements rather than actual pages is beyond me. -- Editor)
Robot IP authentication
This new feature is is helpful when another robot is masquerading as a search indexing robot, but sending multiple requests per second, not following the directives in robots.txt, or otherwise causing site problems. If the robot's IP address is not within a search engines corporate IP blocks, it may be a rogue, but can be blocked without fear of losing the site's place in the search engine index. Yahoo has a nice step-by-step explanation of reverse DNS authentication.
Instructions in robots.txt - all directives to the robot crawler
- User-agent specification
- Disallow: directive
- Allow: directive
- Wildcards in URL paths
- Sitemap location
- Crawl-delay (Yahoo and MSN only)
X-Robots-Tag for non-HTML documents
The problem with META Robots tags has been that they can only appear in HTML documents, not plain text, PDF, office documents, audio, or video files.
With this new system, a web server can send the X-Robots-Tag in the HTTP header, which is a standard but invisible part of any response to a browser or robot request. This is not something anyone can type in by hand, it's an automated system. But that also means that webmasters can program this tag into the HTTP header sent for any page, image, video or other document.
Apache, php, and other web servers can easily be configured send an HTTP header. The syntax is almost exactly the same as the META robots tag: NOINDEX, NOFOLLOW, etc. Apparently, the consensus is that the header tag should have priority the META tags, although I'm not sure this is a good idea. We will be testing how search engine robots implement priorities in the real world.
META Robots tags - crawler and indexer directive
- NOINDEX still works
- NOFOLLOW still works (crawler directive)
- NOSNIPPET attribute
- NOARCHIVE attribute
- NOODP attribute
- NOYDIR (Yahoo only)
Robots-nocontent Attribute (Yahoo Only)
A class or CSS attribute that surrounds text which shouldn't be indexed because it's full of repetitive navigation text, syndicated stories, or other content not really appropriate for search.
This is similar to a site-search convention using the pseudo-tag <!-- noindex -->junk text<!-- /noindex--> and Ultraseek's <!--stopindex--><!--startindex-->
Other proposed extensions of the Robots.txt standard:
- Sean Connor's proposal for a An Extended Standard for Robot Exclusion (2002)
- Charles Koller's Robot Exclusion Standard Revisited (1996) (now offline)
For more information about robots on the SearchTools Site:
- Robots Information Page
- Summary of the most important things about web crawling robots
- Robots.txt Page
- Specific information on entries in the robots.txt file, old and new
- META Robots Tag Page
- Describes the META Robots tag contents and implications for search indexing robots.
- Indexing Robot Checklist
- A list of important items for those creating robots for search indexing.
- List of Robot Source Code
- Links to free and commercial source code for robot indexing spiders
- List of Robot Development Consultants
- Consultants who can provide services in this area.
- Articles and Books on Robots and Spiders
- Overview articles and technical discussions of robot crawlers, mainly for search engines.
- SearchTools Robots Testing
- Test cases for common robot problems.
Page Updated 2008-07-03