Home Guide Tools Listing News Background Search About Us

SearchTools.com

Search Indexing Robots and Robots.txt


Search engine robots will check a special file in the root of each server called robots.txt, which is, as you may guess, a plain text file (not HTML). Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to cgi, private and temporary directories, for example, because they do not want pages in those areas indexed.

The syntax of this file is obscure to most of us: it tells robots not to look at pages which have certain paths in their URLs. Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: everything not forbidden is OK.

You can usually read this file by just requesting it from the server in a browser (for example, www.searchtools.com/robots.txt). You'll see it as a simple text page, but it's easy to read.

This is all documented in the Standard for Robot Exclusion, and all robots should recognize and honor the rules in the robots.txt file.

Entry Meaning
User-agent: *
Disallow:

The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
In this example, all robots can visit every directory except the three mentioned.
User-agent: BadBot
Disallow: /

In this case, the BadBot robot is not allowed to see anything. The slash is shorthand for "all directories"

The User Agent can be any unique substring, and robots are not supposed to care about capitalization.

User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /private/
The blank line indicates a new "record" - a new user agent command.

BadBot should uts go away. All other robots can see everything except the "private" folder.

User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html

User-agent: *
Disallow: /tmp/
Disallow: /private/

This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.

All other robots can see everything except the tmp and private directories.

If you think this is inefficient, you're right!

Bad Examples - Common Wrong Entries
use one of the robots.txt checkers to see if your file is malformed
User-agent: *
Disallow /
NO! This entry is missing the colon after the disallow.

User-agent: *
Disallow: *

NO! If you want to disallow everything, use a slash (indicating the root directory).

User-agent: sidewiner
Disallow: /tmp/

NO! Robots will ignore misspelled User Agent names. Check your server logs and the listings of User Agent names.

User-agent: *
Disallow: /tmp/

User-agent: Weirdbot
Disallow: /links/listing.html

Disallow: /tmp/

NO! Robots generally read from top to bottom and stop when they reach something that applies to them. So Weirdbot would stop at the first record, *, instead of seeing its special entry.

Thanks to Enrico Altavilla for pointing out this problem in my own robots.txt file!

 
Thanks to B. at Ultraseek support
for suggesting a "bad examples" section

The official guidelines were written up in 1996 or so:

Robots.txt Checkers

Robotcop

  • Agents and Robots List - WebReference.com
  • lightly annotated listing of agent and robot software
  • Search Engine Robots
  • Lists of search engines, agent names and their information links, updated fairly frequently.
Page Updated 2005-09-09

Home
Guide
Tools Listing
News
Search
About Us
SearchTools.com - Copyright © 1998-2007 Search Tools Consulting
This work is provided under a Creative Commons Sampling Plus 1.0 License.