Robots.txt
Robots.txt is a standard file allowing webmasters to control which directories are available for web robots: for more information, see the Robots.txt Guide Page.
To test how well a robot obeys Robots.txt, we made a link to a page that our robots.txt file indicates should not be indexed. In this case it's the robots-test file in the /test/robots/disallow subdirectory. The robots.txt file for this site includes this line:
User-agent:*
Disallow: /test/robots/disallow/
Any robot that indexes the pages in this directory is disobeying this rule. Any search engine that indexes the page will have the term R Test 101 (without spaces), but it is disobeying the robots.txt directive.
Note that the robots controls only apply to cooperating robots: malicious ones may ignore the disallows or even use them as directions to interesting data. The only way to be sure that robots will not index a page is to use access controls, for which, see our password protection tests.
Robots META Tag
In addition to server-wide robot control, web page creators can also specify that certain pages should not be indexed by search engine robots, or that the links on the page should not be followed by robots, using the Robots META tag.
The following pages test whether search indexing robots correctly obey the commands in the Robots META tag.
- NoIndex: this page should not be indexed (code R Test 102) , but the robot should follow the link from it to the meta-follow-target page (R Test 103).
- NoFollow: this page should be indexed, but the robot should not follow the link on it to the meta-nofollow-target page (R Test 104)
- NoIndex and NoFollow: this page should not be indexed (R Test 105), and the robot should not follow the link from it to the meta-follow-target2 page either (R Test 106).
- NoArchive: this page should be indexed by search engines, but the "cache" link should not be displayed. (R Test 1100)
- NoSnippet: this page should be indexed, but the way I read the announcement, there should be nothing in the search engine results listing: no match terms from the context, meta description tag content, or ODP description (R Test 1101). Having the description tag would be great, so I'm testing it too (R Test 11015).
X-Robots-Tag
The X-Robots-Tag is a relatively new concept. It contains the Robots META tag values and can be inserted into an HTTP response header by a script, application, CGI, Apache .htaccess, or any other automated web response tool. This means the site can control the robots access and indexing of non-HTML documents, and automate instructions about all pages without changing the content. For more information, see my X-Robots-Tag page.
- NoIndex: this page should not be indexed (code R Test 1102), but the robot should follow the link from it to the meta-follow-target page (R Test 1103).
-
NoFollow: this page
should be indexed, but the robot should not follow the link on it to the meta-nofollow-target
page (R Test 1104)
- NoIndex and NoFollow: this page should not be indexed (R Test 105), and the robot should not follow the link from it to the meta-follow-target2 page either (R Test 1106).
- NoArchive: this page should be indexed by search engines, but the "cache" link should not be displayed. (R Test 1107)
- NoSnippet: this page should be indexed, but the way I read the announcement, there should be nothing in the search engine results listing: no match terms from the context, meta description tag content, or ODP description (R Test 1108). Having the description tag would be great, so I'm testing it too (R Test 11085).
Page Modified: 2011-01-13