SearchTools.com tests for how well search indexing robots can handle robot rules and complex linking. Many robots (also known as crawlers and spiders) are easily confused by anything beyond a simple URL, so these test help us identify the ones with more sophistication.
In addition, these tests will tell us how many robots can handle text in ALT and Comment tags, HTML header tags such as Meta Keywords, and more.
To try out this system, we've coded each page with "RTest", and more specifically, with "RTestGood", for successful indexing and "RTestProblem" for pages which should not be indexed.
NOTE: I know many of these tests are now funky, and have been since I changed servers (if not before). I will be updating them as I have time and energy. Please give me suggestions and comments and offers of help by replying to this blog entry. - Editor
Following Robot Standards
- Robots Test - test whether indexing robots honor robots.txt and the META Robots tag
Following Links
- Dates -
examples of problem dates (old, future, dynamic) and some attempts to insert correct dates for search indexing.
- Frames - check how well indexing robots can index framed documents and noframes, and how they display them when found
- Image Maps - some indexing robots will not recognize links in client-side image maps (server-side maps are even worse, and no one will test them for links.
- JavaScript Menu - see if indexing robots recognize JavaScript href menus or follow noscript links
- JavaScript Document.Write - test whether indexing robots can handle complex JavaScript
- Redirect - see how well the indexing robots follow server redirects and META Refresh redirect links.
- Beyond ".html" - will search robots follow links to text pages with different file suffixes, such as .txt, .asp, .cf, .pl, .ssi, .shtml, and .xml?
- Non-text pages - testing whether indexing robots will index binary files such as Acrobat and Microsoft Word (.pdf, .doc, .xls, etc.)
Non-alphabetic Characters in URLs - links to pages with characters such as !, (), and ~.
- Directory Listings - links to files in a folder automatically generated by the server.
- Relative Links - following both standard and strange relative links.
- Directory Link Depth - how deep into a site will an indexer go? Does it matter whether the directory name is different or the same?
Protected pages - some pages may be allowed to search indexing robots if the search admin gives them the right password (realm: protectallow, user name: robot, password: allow), while others are disallowed and should never be indexed.
Indexing Text
- Image Alt Tag Test - see which indexing robots index text in Alt tags for better descriptions of images.
- Comment Test - test if indexing robots index text in comments or follow links in comments.
- Extended Character Codes Test - examples and text of non-English characters, for the Roman character set, such as diacritical letters.
- Meta Tag Data Test - some engines can index and retrieve words in the HTML Meta tags, Dublin Core, and more.
- Detecting Duplicate Pages - some indexing systems will recognize duplicate pages and only display one copy. The trick is finding exact duplicates, rather than those with minor but important differences.
- MP3 File MetaData - for a summary of the issues, see our MP3 Search Report.
- Anchors - testing whether external anchor text is used to index a target page, whether text in anchors is ranked higher than other instances of that text, and whether there's any way to see the closest anchor text as part of search results.
- NoIndex tags
Retrieval and Relevance
- Relevance Ranking - looking at how search engines perform relevance ranking -- number of matches, length of file, position on page, meta tags, title and header tags, etc.
Coming Soon
- DHTML Layers
- Virtual Hosts
- Punctuation in URL - check whether the indexing robots follow links which include ? and $ in the URL.
- Dynamic page and database interaction
- CGI pages
- Session keys
- cookies
- database
Comment on These Tests
Page Created: 2005-05-11