As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com

About Robots.txt and Search Indexing Robots


Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories, for example, if they do not want pages in those areas indexed.

In June, 2008 webwide search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes.

About the Robots.txt file

The robots.txt file is divided into sections by the robot crawler's User Agent name. Each section includes the name of the user agent (robot) and the paths it may not follow. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: every path not forbidden is allowed.

Note that disallowing robots is not the same as creating a secure area in your site, as only honorable robots will obey the directives and there are plenty of dishonorable ones. Anything you do not want to show to the entire World Wide Web, you should protect with at least a password.

You can usually read the robots.txt file by just requesting it from the server in a browser (for example, www.searchtools.com/robots.txt). If you click it, you'll see that it's a text file with many entries that I generated by looking at my server's error reports, because I wanted to avoid having those even occasionally requested by robots.

The older version is documented in the REP (Robot Exclusion Protocol), and all robots should recognize and honor the rules in the robots.txt file. The New 2008 REP (Robot Exclusion Protocol) has additional features and may not be recognized by all robot crawlers.

Robots.txt Notes


Examples of Robots.txt Entries

Entry Meaning

User-agent: *
Disallow:

Because nothing is disallowed, everything is allowed for every robot

User-agent: mybot
Disallow: /

Specifically, the mybot robot may not index anything, because the root path (/) is disallowed.
User-agent: *
Allow: /

For all user agents, allow everything ( 2008 REP update)

User-agent: BadBot
Allow: /About/robot-policy.html Disallow: /

The BadBot robot can see the robot policy document, but nothing else. All other user-agents are by default allowed to see everything.

This only protects a site if "BadBot" follows the directives in robots.txt

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private

In this example, all robots can visit the whole site, with the exception of the two directories mentioned and any path that starts with private at the host root directory, including items in privatedir/mystuff and the file privateer.html

User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /*/private/*

The blank line indicates a new "record" - a new user agent command.

BadBot should just go away. All other robots can see everything except any subdirectory named "private" ( using the wildcard character)

User-agent: WeirdBot
Disallow: /links/listing.html
Disallow: /tmp/
Disallow: /private/

User-agent: *
Allow: /
Disallow: /temp*
Allow: *temperature*

Disallow: /private/

This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.

All other robots can see everything except the temp directories or files, but should crawl files and directories named "temperature", and should not crawl private directories. Note that the robots will use the longest matching string, so temps and temporary will match the Disallow, while temperatures will match the Allow.

If you think this is inefficient, you're right.

Bad Examples - Common Wrong Entries
use one of the robots.txt checkers to see if your file is malformed
User-agent: googlebot
Disallow /
NO. This entry is missing the colon after the disallow.

User-agent: sidewiner
Disallow: /tmp/

NO. Robots will not recognize misspelled User Agent names (it should be "sidewinder"). Check your server logs for User Agent name and the listings of User Agent names.
User-agent: MSNbot
Disallow: /PRIVATE
WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or Private.

User-agent: *
Disallow: /tmp/

User-agent: Weirdbot
Disallow: /links/listing.html
Disallow: /tmp/


WARNING Robots generally read from top to bottom and stop when they reach something that applies to them. So Weirdbot would probably stop at the first record, *, instead of seeing its special entry.

If there's a specific User Agent, robots don't check the * (all user agents) block, so any general directives should be repeated in the special blocks.

Thanks to B. at Ultraseek support for suggesting a "bad examples" section, Enrico for discussion of precedence, and Melinda, Jim, and AZ, for pointing out mistakes in the table of examples, since corrected

For more information, see the Robots Exclusion Protocol page

Robots.txt syntax checkers

These utilities will read a robots.txt file from a web site and report on any problems or issues:

Google Robots.txt Analyzer (must log into Google Webmaster Tools, Dashboard > Tools)
Recognizes Allow and wildcards, provides an interactive test to locate errors in the robots.txt syntax without having to wait for the googlebot to read the robots.txt file again.
 
UKOLN WebWatch /robots.txt checker
As of June 25, 2008, recognizes Allow and wildcard characters, but does not report directives with no user agent, or a semicolon instead of a colon after a directive. University consortium, no advertising,
SearchEnginePromotion's robots.txt Checker
 
An SEO site, but the ads are static, not blinking. This one recognizes Allow directives but not wildcards, which are flagged as errors. (As of July 2, 2008)
Simon Wilkinson's Robots.txt syntax checker
Formerly of the Tardis project and Botwatch, no ads, flags Allow directives and wildcards as errors, but he may fix that. (As of July 2, 2008)
 
Motoricerca Robots.txt Syntax Checker
A low-key SEO site, no ads, does not recognize Allow or wildcards. (As of July 2, 2008)

Robotcop

This free Apache server module watches for spiders which read pages disallowed in robots.txt, and blocks all further requests from that IP address. It is particularly useful for blocking email address harvesters, while still allowing legitimate search engine spiders. Be sure to double-check your robots.txt file (use one or more of the checkers above), before implementing it, and to watch your server logs carefully. The August 2002 version (0.6) works with Apache 1.3 on FreeBSD and Linux.

 

Listings of Robot User-agent Names

Note that your robots.txt file does not have to include complete names or version numbers -- the protocol says "A case insensitive substring match of the name without version information is recommended." That means that you'd do better specifying webcrawler than WebCrawler/3.0 Robot libwww/5.0a.

List of User-Agents (Spiders, Robots, Browser) at user-agents.org
This one is so current, it even has the iPhone Core Media player user-agent. The database is searchable. (2007-07-02)
 
List of Known Robot User-Agent Fields
Helpful list of user agents with notes about whether the robots are email collectors (spammers). May not be updating.
 
Web Robots Database
List of user-agent names, but it's not at all current
SearchEngineWatch SpiderSpotting Chart
Definitely antique, it lists Google as "experimental")
Search Engine Robots on jafsoft
Nice, but last updated January 2006

For more information about robots on the SearchTools Site:

Robots Information Page
Summary of the most important things about web crawling robots
Robots.txt Page (this page)
Specific information on entries in the robots.txt file, old and new rules
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Robots Exclusion Protocol (REP) Page
Links to definitive sources on the Robots Exclusion Protocol, old and new versions
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problems.
Page Updated 2008-09-19