Adventures with Web Search Indexing Robots

How Web and Intranet Search Engines Follow Links to Build Indexes

Presentation to the Information Acess Seminar, UC Berkeley SIMS (School of Information Management and Studies) in November, 1999.


Avi Rappoport
Search Tools Consulting,


see also: Robots pages

Defining Search Indexing Robots

Following Links

What Robots Extract and Index.

  1. Individual words (searchable as keywords)

    1. Position on page (higher may be more important)

    2. Paragraph tag (H1 may be more important)


    4. Alt text

  2. Title

  3. Meta Keywords

  4. Meta Descriptions

  5. Other Metadata (sophisticated engines only)

Breadth-First Crawling

breadth-first diagram

Depth-First Crawling

depth-first diagram

Other Indexing Issues

Spidering Depth

Server Load

Password-Protected Pages

Encrypted Data

User Agents in Web Server Logs

SearchEngineWatch SpiderSpotting Chart



Robots META Tag

In the HTML <HEAD> sections of individual pages 




Do not index, but follow links

<meta name="ROBOTS" content="NOINDEX">

Use this for pages with many links on them, but not much useful data. Because "follow" is the default, you don't have to include it.

Index, but do not follow links

<meta name="ROBOTS" content="NOFOLLOW">

Use this for pages which have useful content but links which may be irrelevant or obsolete.

Do not index or follow links

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> 

This is for pages which should not be indexed at all. If you put that in every page, the site should not be indexed.

Index and follow links

<meta name="ROBOTS" content="INDEX,FOLLOW">

This is the default behavior: you don't have to include this tag.


Details of Following Links and Crawling Sites

Clients that can't read complex links

Image Maps

<map name="client-side-map">
  <area shape="circle" coords="91,49,38"
  <area shape="rect" coords="180,19,293,78"


<a href="map/"><img src="img/usa.gif" ISMAP></a>

Frames and Framesets

<FRAMESET cols="160,*" border=0>
  <FRAME src="/main.html" scrolling="auto" border=10 NAME="main">
  <FRAME src="right.html" scrolling="auto" border=10 NAME="right">
  <FRAME src="bottom.html" scrolling="no" border=0 marginwidth=0 
                          marginheight=0 NAME="bottom">

  <BODY><H1>Welcome to our Site!</H1>
     <H2><A HREF="">Main Page</A></H2>
     <H2>Other Pages</H2>
       <LI><A HREF="trees.html">Trees</A></LI>
       <LI><A HREF="flowers.html">Flowers</A></LI>
       <LI><A HREF="mammals.html">Mammals</A></LI>
       <LI><A HREF="insects.html">Insects</A></LI>
       <LI><A HREF="fish.html">Fish</A></LI>
     <P><A HREF="home.html">Home</A></P>
     <P><A HREF="about.html">About Us</A></P>


JavaScript Pages

     document.writeln ("<H3>JavaScript Example</H3>\
     <P>This text is generated by a JavaScript. If you can see it, you are using a \ 
     JavaScript-compatible \
     browser or other HTTP client that contains a JavaScript interpreter. For \
     more information, see our <A HREF='tech/javascript/'> \
      JavaScript Pages</A>. <\P> \
     // -->

    <P>This is only visible to browsers that cannot interpret JavaScript. 
    For more information, see our <A HREF="tech/javascript/"> 
    JavaScript Pages</A>. <\P>

In the NOSCRIPT section of the code, the link is available for robots and non-JavaScript browsers.

JavaScript Menus and Lists

  <H3>Fruit We Grow</H3>
  <FORM NAME="links">
     <P><SELECT NAME="select"

       <OPTION VALUE="apples.html" selected>apples</OPTION>
       <OPTION VALUE="../berries/strawberries.html">strawberries</OPTION>
       <OPTION VALUE="../berries/boysenberries.html">boysenberries</OPTION>
       <OPTION VALUE="/exotics/kiwi.html">kiwifruit</OPTION>
       <OPTION VALUE="">more fruit</OPTION>
	<P>If you don't have JavaScript, please use these links</P>
       <LI><A HREF="apples.html">apples</A></LI>
       <LI><A HREF="../berries/strawberries.html">strawberries</A></LI>
       <LI><A HREF="../berries/boysenberries.html">boysenberries</A></LI>
       <LI><A HREF="/exotics/kiwi.html">kiwifruit</A></LI>
       <LI><A HREF="">more fruit</A></LI>


Redirected Files

automatic transfer from one URL to another

Server Redirect Files

  HTTP/1.1 302
  Found Location:


META Refresh Redirection

client-based, META tag within a file

  <META http-equiv="Refresh" content="10; URL=target.html">

New Standards

Non-Standard File Names

Names with Extensions Other Than .html and .htm

Punctuation in URLs

can include letters, numbers, . (period), ~ (tilde), - (dash), and _ (underscore).

Robots should also handle file paths with the delimiters used in standard URLs:

Some URLs, usually those involved with dynamic data, have other punctuation marks. These are:

Relative Links


originating file


points to

simple file name: start at this level and find another document


<A HREF= "three.html">


name followed by a / (slash) and a file name: start at this level, go into the directory, and open the document


<A HREF= "level2/six.html" >


/ (slash): start at the top level of this host


<A HREF= "/five.html">


../ (dot, dot, slash): start at the next level up, in the parent directory


<A HREF= "../four.html">


./ (dot slash) start at the current directory (this may be a typo)


<A HREF= "./three.html">


The Site Map Solution, and its Limitations

Dynamic Data

Indexing Dynamic Data

Dynamic Web Applications and Black Holes

Detecting Duplicate Pages

Re-crawling a Site to Update the Index

Tracking Robot Spiders



Useful Links Robots pages

Web Robots pages

HTML Author's Guide to the Robots META tag

HTML 4.0 robots appendix

Robots Mailing List

Email to LISTSERV@MAIL2.MCCMEDIA.COM and in the TEXT write: SUBSCRIBE robots

UKOLN Robots.txt checker

BotWatch robots.txt checker

BotSpot robots pages

SearchEngineWatch SpiderSpotting Chart

W3C Web Content Accessibility Guidelines

Presentation to the Information Acess Seminar, UC Berkeley SIMS (School of Information Management and Studies) in November, 1999.

Avi Rappoport
Search Tools Consulting,

robot indexing test link