|
Intro
Checking Links
Test Index
Test Search
Search Forms
Result Pages
Search Result Items
Maintenance & Logs
Choose
the Service For You
|
Before you install any search engine with a indexing spider, you must
make sure it can find the pages on your site. The good news is that cleaning
up your links will improve your accessibility to the large public search
engines (such as AltaVista, Google, HotBot and Infoseek), and make it
easier for you to run an automated site mapper.
Robot Spider Compatibility
The indexing spiders follow links from a starting page, so use a
home page if you have good text links, or a site map page.
The first thing is to check the "robots.txt" file. This
is a standard file for web servers that sits at the root of your site,
and excludes robots that are not welcome on the site, or in certain
specific directories (though this is voluntary). If you run your own
server, you control this file: otherwise your host server administrator
controls it.
You want to make sure that this file exists, and that it allows at
least your indexing spider to access your directories. You may need
to negotiate with your web hosting provider on this point, as this
file must be stored in the root folder of the web host.
For more information on this topic, see About
Robots.txt
The other way that page designers can control robots and spiders
is by using the META ROBOTS tags. These are particularly useful if
you have a hosted site and don't want to bother your server administrator.
Y
For example, if you have a directory listing or site map page, you
can tell the spiders to follow the links but not index the text on
the page by placing the following information into the HTML header:
<meta name="robots" content="noindex,follow">.
If you have pages with useful data but inappropriate links, such as
a web calendar page with duplicate links to other calendar pages,
use <meta name="robots" content="index,nofollow">.
For more information, see
About Meta Robots tags.
Good Links and Bad Links
Indexing spiders tend to be pretty dumb. They know about the simple
HREF links, but just get lost on anything more complex. Spiders and
robots will probably not follow links in:
- image maps (especially server-side image maps)
- redirect and META Refresh tags
- Framesets
- DHTML layers
- ActiveX controls
- JavaScript menus and pages
- Java pages and site maps
- Flash or Shockwave (even if you use the options to generate
HTML text and links)
Check Your Links
To give yourself a spider-eye view, try a text browser such as Lynx,
or a graphical browser with images and JavaScript turned off, and no
Plug-Ins: this will give you a good view of what the spiders see.
Don't rely on your content-management system to check local links:
it knows too much about the structure of your site and the special formats
you use.
To make sure all your local links work, run a link-checking robot such
as LinkScan
for Windows or Big
Brother for Mac & Unix, or use a service such as NetMechanic.
If these services can follow the links, there's a good chance that your
search indexing robot can do the same.
Solution: Supplement Complex Links
If you find you have problems, there are two ways around bad links:
both require work, but they will make the indexing spiders happy.
- Alternate Navigation: add alternate links in <NOSCRIPT>
and <NOFRAMES> tags, lists of the links from image maps, simple
alternate pages for DHTML and Java pages, etc. This should work for
all kinds of robots and spiders.
- Site Page Listing: make site map or a page with links to
every page on your site. This is hard to maintain and synchronize
with your other changes. You can't use a site mapper application that
uses a link-following robot, because it will have the same problems
that the search engine spiders have.
Five Advantages for the Price of One
The good news is that all this work will pay off in five ways:
- Your search engine robot spider can find your pages
- The robot spiders for the webwide public search engines such
as HotBot, AltaVista and Google can find your pages
- Robot-based link checkers can check your links
- Robot-based site map creators can find your pages to make a
map
- Your site is now accessible to blind and visually-disabled web
surfers (as described in the Web
Accessibility Initiative), and those using text browsers such
as PDAs.
Search Services and Complex Links
| name |
robots .txt
|
robots meta tag |
client image maps |
re-directs |
meta refresh |
frames |
dynamic pages with "?" |
notes |
| Atomz |
yes |
no
|
yes |
yes |
yes |
yes |
yes |
Must check the "clear index cache" checkbox to have it
read robots.txt again. Can index PDF files. |
|
FreeFind
|
yes |
no
|
yes |
yes |
yes, and indexes source |
yes |
no |
Has custom tags to indicate when not to index |
| Google |
yes |
yes
|
yes
|
yes
|
yes, and indexes source if more than 10 seconds
|
yes
|
no
|
|
| IndexMySite |
yes, optional
|
no
|
no
|
yes
|
no
|
yes
|
yes
|
|
| siteLevel |
yes |
no
|
yes |
yes |
yes |
yes |
yes |
Ignores meta robots tag, search admin can
exclude & include paths |
| MiniSearch |
yes
|
yes
|
yes
|
no
|
no
|
yes
|
|
|
| MondoSearch |
yes |
yes
|
yes |
yes |
yes |
yes |
yes |
Stores the frame context, shows pages framed. Allows search administrator
to override robots.txt if necessary
|
| PicoSearch |
yes |
yes
|
yes |
no |
no |
yes |
yes |
|
| NetCreations PinPoint |
yes |
yes
|
no |
no |
no |
yes |
|
|
| SiteMiner |
yes |
noindex yes, nofollow no
|
yes |
yes |
yes |
no |
yes |
|
| Webinator |
yes |
no
|
yes |
yes |
yes |
yes |
|
|
|