As of January, 2012, this site is no longer being updated, due to work and health issues
SearchTools.com
The Elements of Robots.txt
Robots.txt is the expression of the Robots Exclusion Protocol, an agreement among search engine and other robot writers and web site publishers to avoid certain areas of a site which may be inappropriate for automated access.
See also the Robots.txt page and the Robots Meta page
User-agent
The User-agent is the name of the client (browser or robot), sent as part of an HTTP request (as set in the HTTP RFC). It will appear in web site log files with version number and name such as Internet Explorer, Opera, or Mozilla (browsers), or Slurp, Googlebot, or MSNBot (search engine robots). In a robots.txt file, directives may be aimed at all robot clients (User-agent: *) or at a specific one (User-agent: slurp). In addition to browsers and the three biggest webwide search engines, many other robots may request content from a site. (See links to listings of User-agents)
Disallow
Directive indicates that robots should not access the specific directory, subdirectory or file. This has been expressed as either a full URL path from the root directory or a right-truncated string (in the original REP), while the new REP supports now wildcards for URL paths.
User-Agent: googlebotDisallow: /corrected-pages-old-versions/Allow
This directive specifically allows robots to follow certain paths when crawling. This means you can Allow a section of a site, and Disallow a specific subsection. Here's an example that combines these concepts, applied to all robot crawlers:
User-Agent: *Allow: /products/specsheets/Disallow: /products/printable/Allow: /products/printable/cartoons/Allow: /forum/Disallow: /forum/calendar_week.aspWhile the previous robot exclusion protocol assumed that anything not disallowed was allowed, the new rules makes it more explicit. All three major web search engines plan to compare URLs to all directives, and apply the directive in the longest matching path, (such as "cartoons" in the example above), and indicate that the order in the robots.txt file doesn't matter.
Wildcards
Wildcards, typed as asterisks (*), are used to specify any number of characters (including zero), in the URL paths. They make it easy to use patterns to direct the robots to the parts of a site that should be indexed, but keep them away from areas that should not. The $ character specifies that the pattern matched must be at the end of the file path.
Disallow: *.doc$Disallow: /research/*/old_version/*Allow: /research/findings/*Note that wildcards can be tricky, as some strings that seem obvious may unexpectedly match other strings (e.g. "path" and "pathetic"). Simplicity in using wildcards depends a great deal on your site URL structure and information architecture. SearchTools.com will be setting up a robots wildcard test suite and will report on the results.
Sitemap location
Sitemaps are more than just lists of URLs in XML formats. They can contain information about specific pages, such as the last modified date, which is helpful if your web server doesn't keep proper track of content change dates, as well as that page's expected change frequency. A priority tag tells the crawler which pages you think are most important, although it's unlikely that any of the larger search engines will use that for relevance ranking or even recrawl control. Sitemaps.org has a complete schema and details about the options.
Sitemap: http://example.com/mainsitemap.xmlYahoo and MS Live Robots Only
Crawl-Delay
Allows sites to set a limit on the robot crawl rate. It's useful if the site cannot handle multiple requests per second, perhaps because the speed of the robot crawling is slowing down transactions for customers or other users. This can be done for each User-agent section, and the number is a single decimal place amount of time in seconds. So to tell Yahoo to only send one request every three seconds, use the number 3 (Yahoo instructions)
User-agent: SlurpCrawl-delay: 3.0To tell Microsoft Live Search to send two requests per second, use 0.5 (MSN instructions)
User-agent: MSNBotCrawl-delay: 0.5
For more information about robots on the SearchTools Site:
- Robots Information
- Introduction to web crawling robots for search indexing and other purposes
- Robot Exclusion Protocol (REP)
- Information on the original protocol and the June 2008 search engines extensions
- Elements of Robots.txt
- Detailed description of the robots.txt directives and path options
- Robots.txt Details
- Practical notes on implementing robots.txt
- META Robots Tag Page
- Describes the META Robots tag contents and implications for search indexing robots.
- Indexing Robot Checklist
- A list of important items for those creating robots for search indexing.
- List of Robot Source Code
- Links to free and commercial source code for robot indexing spiders
- List of Robot Development Consultants
- Consultants who can provide services in this area.
- Articles and Books on Robots and Spiders
- Overview articles and technical discussions of robot crawlers, mainly for search engines.
- SearchTools Robots Testing
- Test cases for common robot problems.
Page Updated: 2008-06-13
| Home | Guide | Tools Listing | News | Background | Search | Contact |
Search Tools Consulting's principal analyst, Avi Rappoport, may be available to help you with selection, analysis, user experience, and functional search engine work. Please contact us with your questions, comments, or possible consulting discussions.
SearchTools.com - Copyright © 2008-2009 Search Tools Consulting.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.