As of January, 2012, this site is no longer being updated, due to work and health issues

 

SearchTools.com

The Elements of Robots.txt

Robots.txt is the expression of the Robots Exclusion Protocol, an agreement among search engine and other robot writers and web site publishers to avoid certain areas of a site which may be inappropriate for automated access.

See also the Robots.txt page and the Robots Meta page

User-agent

The User-agent is the name of the client (browser or robot), sent as part of an HTTP request (as set in the HTTP RFC). It will appear in web site log files with version number and name such as Internet Explorer, Opera, or Mozilla (browsers), or Slurp, Googlebot, or MSNBot (search engine robots). In a robots.txt file, directives may be aimed at all robot clients (User-agent: *) or at a specific one (User-agent: slurp). In addition to browsers and the three biggest webwide search engines, many other robots may request content from a site. (See links to listings of User-agents)

Disallow

Directive indicates that robots should not access the specific directory, subdirectory or file. This has been expressed as either a full URL path from the root directory or a right-truncated string (in the original REP), while the new REP supports now wildcards for URL paths.

User-Agent: googlebot
Disallow: /corrected-pages-old-versions/

More examples

Allow

This directive specifically allows robots to follow certain paths when crawling. This means you can Allow a section of a site, and Disallow a specific subsection. Here's an example that combines these concepts, applied to all robot crawlers:

User-Agent: *
Allow: /products/specsheets/
Disallow: /products/printable/ 
Allow: /products/printable/cartoons/ 
				
Allow: /forum/ 
Disallow: /forum/calendar_week.asp 	

While the previous robot exclusion protocol assumed that anything not disallowed was allowed, the new rules makes it more explicit. All three major web search engines plan to compare URLs to all directives, and apply the directive in the longest matching path, (such as "cartoons" in the example above), and indicate that the order in the robots.txt file doesn't matter.

Wildcards

Wildcards, typed as asterisks (*), are used to specify any number of characters (including zero), in the URL paths. They make it easy to use patterns to direct the robots to the parts of a site that should be indexed, but keep them away from areas that should not. The $ character specifies that the pattern matched must be at the end of the file path.

Disallow: *.doc$			
Disallow: /research/*/old_version/*
Allow: /research/findings/*

Note that wildcards can be tricky, as some strings that seem obvious may unexpectedly match other strings (e.g. "path" and "pathetic"). Simplicity in using wildcards depends a great deal on your site URL structure and information architecture. SearchTools.com will be setting up a robots wildcard test suite and will report on the results.

Sitemap location

Sitemaps are more than just lists of URLs in XML formats. They can contain information about specific pages, such as the last modified date, which is helpful if your web server doesn't keep proper track of content change dates, as well as that page's expected change frequency. A priority tag tells the crawler which pages you think are most important, although it's unlikely that any of the larger search engines will use that for relevance ranking or even recrawl control. Sitemaps.org has a complete schema and details about the options.

Sitemap: http://example.com/mainsitemap.xml

Yahoo and MS Live Robots Only

Crawl-Delay

Allows sites to set a limit on the robot crawl rate. It's useful if the site cannot handle multiple requests per second, perhaps because the speed of the robot crawling is slowing down transactions for customers or other users. This can be done for each User-agent section, and the number is a single decimal place amount of time in seconds. So to tell Yahoo to only send one request every three seconds, use the number 3 (Yahoo instructions)

User-agent: Slurp
Crawl-delay: 3.0

To tell Microsoft Live Search to send two requests per second, use 0.5 (MSN instructions)

User-agent: MSNBot
Crawl-delay: 0.5

For more information about robots on the SearchTools Site:

Robots Information
Introduction to web crawling robots for search indexing and other purposes
Robot Exclusion Protocol (REP)
Information on the original protocol and the June 2008 search engines extensions
Elements of Robots.txt
Detailed description of the robots.txt directives and path options
Robots.txt Details
Practical notes on implementing robots.txt
META Robots Tag Page
Describes the META Robots tag contents and implications for search indexing robots.
Indexing Robot Checklist
A list of important items for those creating robots for search indexing.
List of Robot Source Code
Links to free and commercial source code for robot indexing spiders
List of Robot Development Consultants
Consultants who can provide services in this area.
Articles and Books on Robots and Spiders
Overview articles and technical discussions of robot crawlers, mainly for search engines.
SearchTools Robots Testing
Test cases for common robot problems.

Page Updated: 2008-06-13

Home Guide Tools Listing News Background Search Contact

Search Tools Consulting's principal analyst, Avi Rappoport, may be available to help you with selection, analysis, user experience, and functional search engine work. Please contact us with your questions, comments, or possible consulting discussions.


Creative Commons LicenseSearchTools.com - Copyright © 2008-2009 Search Tools Consulting.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.