As of January, 2012, this site is no longer being updated, due to work and health issues

Checking Links and Pages

Intro

Checking Links

Test Index

Test Search

Search Forms

Result Pages

Search Result Items

Maintenance & Logs

Choose the Service For You

Before you install any search engine with a indexing spider, you must make sure it can find the pages on your site. The good news is that cleaning up your links will improve your accessibility to the large public search engines (such as AltaVista, Google, HotBot and Infoseek), and make it easier for you to run an automated site mapper.

Robot Spider Compatibility

The indexing spiders follow links from a starting page, so use a home page if you have good text links, or a site map page.

Whole sites: Robots.txt

The first thing is to check the "robots.txt" file. This is a standard file for web servers that sits at the root of your site, and excludes robots that are not welcome on the site, or in certain specific directories (though this is voluntary). If you run your own server, you control this file: otherwise your host server administrator controls it.

You want to make sure that this file exists, and that it allows at least your indexing spider to access your directories. You may need to negotiate with your web hosting provider on this point, as this file must be stored in the root folder of the web host.

For more information on this topic, see About Robots.txt

Individual Pages: META ROBOTS tag

The other way that page designers can control robots and spiders is by using the META ROBOTS tags. These are particularly useful if you have a hosted site and don't want to bother your server administrator. Y

For example, if you have a directory listing or site map page, you can tell the spiders to follow the links but not index the text on the page by placing the following information into the HTML header: <meta name="robots" content="noindex,follow">. If you have pages with useful data but inappropriate links, such as a web calendar page with duplicate links to other calendar pages, use <meta name="robots" content="index,nofollow">.

For more information, see About Meta Robots tags.

Good Links and Bad Links

Indexing spiders tend to be pretty dumb. They know about the simple HREF links, but just get lost on anything more complex. Spiders and robots will probably not follow links in:

  • image maps (especially server-side image maps)
  • redirect and META Refresh tags
  • Framesets
  • DHTML layers
  • ActiveX controls
  • JavaScript menus and pages
  • Java pages and site maps
  • Flash or Shockwave (even if you use the options to generate HTML text and links)

Check Your Links

To give yourself a spider-eye view, try a text browser such as Lynx, or a graphical browser with images and JavaScript turned off, and no Plug-Ins: this will give you a good view of what the spiders see.

Don't rely on your content-management system to check local links: it knows too much about the structure of your site and the special formats you use.

To make sure all your local links work, run a link-checking robot such as LinkScan for Windows or Big Brother for Mac & Unix, or use a service such as NetMechanic. If these services can follow the links, there's a good chance that your search indexing robot can do the same.

Solution: Supplement Complex Links

If you find you have problems, there are two ways around bad links: both require work, but they will make the indexing spiders happy.

  • Alternate Navigation: add alternate links in <NOSCRIPT> and <NOFRAMES> tags, lists of the links from image maps, simple alternate pages for DHTML and Java pages, etc. This should work for all kinds of robots and spiders.

  • Site Page Listing: make site map or a page with links to every page on your site. This is hard to maintain and synchronize with your other changes. You can't use a site mapper application that uses a link-following robot, because it will have the same problems that the search engine spiders have.

    Five Advantages for the Price of One

    The good news is that all this work will pay off in five ways:

    1. Your search engine robot spider can find your pages
    2. The robot spiders for the webwide public search engines such as HotBot, AltaVista and Google can find your pages
    3. Robot-based link checkers can check your links
    4. Robot-based site map creators can find your pages to make a map
    5. Your site is now accessible to blind and visually-disabled web surfers (as described in the Web Accessibility Initiative), and those using text browsers such as PDAs.

Search Services and Complex Links

name

robots .txt

robots meta tag client image maps re-directs meta refresh frames dynamic pages with "?" notes
Atomz yes
no
yes yes yes yes yes Must check the "clear index cache" checkbox to have it read robots.txt again. Can index PDF files.

FreeFind

yes
no
yes yes yes, and indexes source yes no Has custom tags to indicate when not to index
Google yes
yes
yes
yes
yes, and indexes source if more than 10 seconds
yes
no
IndexMySite
yes, optional
no
no
yes
no
yes
yes
 
siteLevel yes
no
yes yes yes yes yes Ignores meta robots tag, search admin can exclude & include paths
MiniSearch
yes
yes
yes
no
no
yes
 
MondoSearch yes
yes
yes yes yes yes yes

Stores the frame context, shows pages framed. Allows search administrator to override robots.txt if necessary

PicoSearch yes
yes
yes no no yes yes  
NetCreations PinPoint yes
yes
no no no yes    
SiteMiner yes
noindex yes, nofollow no
yes yes yes no yes  
Webinator yes
no
yes yes yes yes    

 

Previous Page: Intro | Next Page: Test Index