REP - Robots Exclusion Protocol - Search Tools Report

The Robots Exclusion Protocol developed in 1994-1997

This descrobes robots.txt file with a Disallow directive to indicate which directories do not welcome robots. It also describes robots META tags, indicating if a page should not be indexed, and/or the links on the page should be followed.

Robots Exclusion Protocol, June 2008 Agreement

In June, 2008, Yahoo, Google and MSN Live Search all announced that they would support common extensions to the REP, based on their individual experiences with devations from the protocol.

(Why these are are all blog announcements rather than actual pages is beyond me. -- Editor)

Robot IP authentication

This new feature is is helpful when another robot is masquerading as a search indexing robot, but sending multiple requests per second, not following the directives in robots.txt, or otherwise causing site problems. If the robot's IP address is not within a search engines corporate IP blocks, it may be a rogue, but can be blocked without fear of losing the site's place in the search engine index. Yahoo has a nice step-by-step explanation of reverse DNS authentication.

Instructions in robots.txt - all directives to the robot crawler

User-agent specification
Disallow: directive
Allow: directive
Wildcards in URL paths
Sitemap location
Crawl-delay (Yahoo and MSN only)

X-Robots-Tag for non-HTML documents

The problem with META Robots tags has been that they can only appear in HTML documents, not plain text, PDF, office documents, audio, or video files.

With this new system, a web server can send the X-Robots-Tag in the HTTP header, which is a standard but invisible part of any response to a browser or robot request. This is not something anyone can type in by hand, it's an automated system. But that also means that webmasters can program this tag into the HTTP header sent for any page, image, video or other document.

Apache, php, and other web servers can easily be configured send an HTTP header. The syntax is almost exactly the same as the META robots tag: NOINDEX, NOFOLLOW, etc. Apparently, the consensus is that the header tag should have priority the META tags, although I'm not sure this is a good idea. We will be testing how search engine robots implement priorities in the real world.

META Robots tags - crawler and indexer directive

NOINDEX still works
NOFOLLOW still works (crawler directive)
NOSNIPPET attribute
NOARCHIVE attribute
NOODP attribute
NOYDIR (Yahoo only)

Robots-nocontent Attribute (Yahoo Only)

A class or CSS attribute that surrounds text which shouldn't be indexed because it's full of repetitive navigation text, syndicated stories, or other content not really appropriate for search.

This is similar to a site-search convention using the pseudo-tag junk text and Ultraseek's 

Home

Guide

Tools Listing

News

Background

Search

About Us

Home

Guide

Tools Listing

News

Background

Search

Contact

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enterprise. Please contact SearchTools for more information.

This information copyright © 2010 Avi Rappoport, Search Tools Consulting. Some Rights Reserved, under the Creative Commons Attribution-Share Alike 3.0 United States License. Please attribute copied content to the page's full URL. Permissions beyond the scope of this license are available upon request.

Home

Guide

Tools Listing

News

Background

Search

Contact

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enterprise. Please contact SearchTools for more information.

Home

Guide

Tools Listing

News

Background

Search

Contact

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enterprise. Please contact SearchTools for more information.

Home

Guide

Tools Listing

News

Background

Search

Contact

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enterprise. Please contact SearchTools for more information.

SearchTools.com

The Robots Exclusion Protocol (REP)

The Robots Exclusion Protocol developed in 1994-1997

Robots Exclusion Protocol, June 2008 Agreement

Robot IP authentication

Instructions in robots.txt - all directives to the robot crawler

X-Robots-Tag for non-HTML documents

META Robots tags - crawler and indexer directive

Robots-nocontent Attribute (Yahoo Only)

Other proposed extensions of the Robots.txt standard:

For more information about robots on the SearchTools Site:

Page Updated 2008-07-03