Search engine robots will check a special file in the root of each server called robots.txt, which is, as you may guess, a plain text file (not HTML). Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to cgi, private and temporary directories, for example, because they do not want pages in those areas indexed.
The syntax of this file is obscure to most of us: it tells robots not to look at pages which have certain paths in their URLs. Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: everything not forbidden is OK.
You can usually read this file by just requesting it from the server in a browser (for example, www.searchtools.com/robots.txt). You'll see it as a simple text page, but it's easy to read.
This is all documented in the Standard for Robot Exclusion, and all robots should recognize and honor the rules in the robots.txt file.
Entry Meaning User-agent: * Disallow:The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/In this example, all robots can visit every directory except the three mentioned. User-agent: BadBot
Disallow: /In this case, the BadBot robot is not allowed to see anything. The slash is shorthand for "all directories"
The User Agent can be any unique substring, and robots are not supposed to care about capitalization.
User-agent: BadBot Disallow: / User-agent: * Disallow: /private/The blank line indicates a new "record" - a new user agent command. BadBot should uts go away. All other robots can see everything except the "private" folder.
User-agent: WeirdBot Disallow: /tmp/ Disallow: /private/ Disallow: /links/listing.html User-agent: * Disallow: /tmp/ Disallow: /private/This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.
All other robots can see everything except the tmp and private directories.
If you think this is inefficient, you're right!
Bad Examples - Common Wrong Entries use one of the robots.txt checkers to see if your file is malformedUser-agent: *
Disallow /NO! This entry is missing the colon after the disallow. User-agent: *
Disallow: *NO! If you want to disallow everything, use a slash (indicating the root directory). User-agent: sidewiner
Disallow: /tmp/NO! Robots will ignore misspelled User Agent names. Check your server logs and the listings of User Agent names. User-agent: *
Disallow: /tmp/User-agent: Weirdbot
Disallow: /links/listing.html
Disallow: /tmp/NO! Robots generally read from top to bottom and stop when they reach something that applies to them. So Weirdbot would stop at the first record, *, instead of seeing its special entry.
Thanks to Enrico Altavilla for pointing out this problem in my own robots.txt file!
Thanks to B. at Ultraseek support
for suggesting a "bad examples" section
The official guidelines were written up in 1996 or so:
- Standard for Robot Exclusion
- Guidelines For For Robot Writers
- Web Server Administrator's Guide to the Robots Exclusion Protocol.
Robots.txt Checkers
- Enrico Altavilla's very helpuful Robots.txt checker reports even more errors.
- SearchEngineWorld robots.txt syntax checker, Tutorial, and Most Frequent Problems pages
- BotWatch robots.txt syntax checker
- UK Office for Library and Information Networking - WebWatch Robots.txt checker.
- RoboGen visual editor for Robots Exclusion files, allowing users to choose folders and files interactively, manage multiple domains and recognize large numbers of user agents (robot self-identifiers).
Robotcop
This free server module watches for spiders which read pages disallowed in robots.txt, and blocks all further requests from that IP address. It is particularly useful for blocking email address harvesters, while still allowing legitimate search engine spiders. Be sure to double-check your robots.txt file (use one or more of the checkers above), before implementing it, and to watch your server logs carefully. The August 2002 version (0.6) works with Apache 1.3 on FreeBSD and Linux.
Note that your robots.txt file does not have to include complete names or version numbers -- the standard says "A case insensitive substring match of the name without version information is recommended."
List at robotstxt.org, may not be current.Displays User Agent and host names for webwide search engine robot spiders.
- Agents and Robots List - WebReference.com
- lightly annotated listing of agent and robot software
- Search Engine Robots
- Lists of search engines, agent names and their information links, updated fairly frequently.
There are a few proposed extensions of the Robots.txt standard, but they have been pretty quiet lately:
- Martin Koster's 1996 RFC Draft Memo on Web Robots Control
- Sean Connor's proposal for a An Extended Standard for Robot Exclusion (version 2.0)
- Charles Koller's Robot Exclusion Standard Revisited (1996)
For more information on robots on the SearchTools Site:
Summary of the most important things about web crawling robots Describes the META Robots tag contents and implications for search indexing robots. A list of important items for those creating robots for search indexing. Links to free and commercial source code for robot indexing spiders Consultants who can provide services in this area. Overview articles and technical discussions of robot crawlers, mainly for search engines. Test cases for common robot problems.Page Updated 2005-09-09
|
|
|
|
|
|
|