Adventures with Web Search Indexing Robots

How Web and Intranet Search Engines Follow Links to Build Indexes

Presentation to the Information Acess Seminar, UC Berkeley SIMS (School of Information Management and Studies) in November, 1999.

 

Avi Rappoport
Search Tools Consulting, SearchTools.com

 

see also: SearchTools.com Robots pages


Defining Search Indexing Robots


Following Links


What Robots Extract and Index.

  1. Individual words (searchable as keywords)

    1. Position on page (higher may be more important)

    2. Paragraph tag (H1 may be more important)

    3. comments

    4. Alt text

  2. Title

  3. Meta Keywords

  4. Meta Descriptions

  5. Other Metadata (sophisticated engines only)


Breadth-First Crawling

breadth-first diagram
 


Depth-First Crawling

depth-first diagram


Other Indexing Issues

Spidering Depth

Server Load

Password-Protected Pages

Encrypted Data


User Agents in Web Server Logs

SearchEngineWatch SpiderSpotting Chart

www.searchenginewatch.com/webmasters/spiderchart.html


Robots.txt

 


Robots META Tag

In the HTML <HEAD> sections of individual pages 

Task

Entry

Notes

Do not index, but follow links

<meta name="ROBOTS" content="NOINDEX">

Use this for pages with many links on them, but not much useful data. Because "follow" is the default, you don't have to include it.

Index, but do not follow links

<meta name="ROBOTS" content="NOFOLLOW">

Use this for pages which have useful content but links which may be irrelevant or obsolete.

Do not index or follow links

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> 

This is for pages which should not be indexed at all. If you put that in every page, the site should not be indexed.

Index and follow links

<meta name="ROBOTS" content="INDEX,FOLLOW">

This is the default behavior: you don't have to include this tag.

 


Details of Following Links and Crawling Sites


Clients that can't read complex links


Image Maps

<map name="client-side-map">
  <area shape="circle" coords="91,49,38"
   
href="client-map-circle.html">
  <area shape="rect" coords="180,19,293,78"
    
href="client-map-rect.html">
</map>

 

<a href="map/usa.map"><img src="img/usa.gif" ISMAP></a>


Frames and Framesets


<FRAMESET cols="160,*" border=0>
  <FRAME src="/main.html" scrolling="auto" border=10 NAME="main">
  <FRAME src="right.html" scrolling="auto" border=10 NAME="right">
  <FRAME src="bottom.html" scrolling="no" border=0 marginwidth=0 
                          marginheight=0 NAME="bottom">
</FRAMESET>

<NOFRAMES>
  <BODY><H1>Welcome to our Site!</H1>
     <H2><A HREF="http://www.domain.com/main.html">Main Page</A></H2>
     <H2>Other Pages</H2>
     <UL>
       <LI><A HREF="trees.html">Trees</A></LI>
       <LI><A HREF="flowers.html">Flowers</A></LI>
       <LI><A HREF="mammals.html">Mammals</A></LI>
       <LI><A HREF="insects.html">Insects</A></LI>
       <LI><A HREF="fish.html">Fish</A></LI>
     </UL>
     <P><A HREF="home.html">Home</A></P>
     <P><A HREF="about.html">About Us</A></P>
  </BODY>
</NOFRAMES>

 


JavaScript Pages


<SCRIPT LANGUAGE ="JavaScript">
     <!-- 
     document.writeln ("<H3>JavaScript Example</H3>\
     <P>This text is generated by a JavaScript. If you can see it, you are using a \ 
     JavaScript-compatible \
     browser or other HTTP client that contains a JavaScript interpreter. For \
     more information, see our <A HREF='tech/javascript/'> \
      JavaScript Pages</A>. <\P> \
     ")
     // -->
</SCRIPT>

<NOSCRIPT>
    <P>This is only visible to browsers that cannot interpret JavaScript. 
    For more information, see our <A HREF="tech/javascript/"> 
    JavaScript Pages</A>. <\P>
</NOSCRIPT>


In the NOSCRIPT section of the code, the link is available for robots and non-JavaScript browsers.


JavaScript Menus and Lists


  <H3>Fruit We Grow</H3>
  
  <FORM NAME="links">
     <P><SELECT NAME="select"
       onChange="if(options[selectedIndex].value)
       window.location.href=(options[selectedIndex]Value)">

       <OPTION VALUE="apples.html" selected>apples</OPTION>
       <OPTION VALUE="../berries/strawberries.html">strawberries</OPTION>
       <OPTION VALUE="../berries/boysenberries.html">boysenberries</OPTION>
       <OPTION VALUE="/exotics/kiwi.html">kiwifruit</OPTION>
       <OPTION VALUE="http://www.domain.com/fruitonline.html">more fruit</OPTION>
     </SELECT>
     </P>
     <NOSCRIPT>
	<P>If you don't have JavaScript, please use these links</P>
     <UL>
       <LI><A HREF="apples.html">apples</A></LI>
       <LI><A HREF="../berries/strawberries.html">strawberries</A></LI>
       <LI><A HREF="../berries/boysenberries.html">boysenberries</A></LI>
       <LI><A HREF="/exotics/kiwi.html">kiwifruit</A></LI>
       <LI><A HREF="http://www.domain.com/fruitonline.html">more fruit</A></LI>
     </UL>
     </NOSCRIPT>
   </FORM>

 


Redirected Files

automatic transfer from one URL to another

Server Redirect Files


  HTTP/1.1 302
  Found Location: http://www.domain.com/new/specials.html


 

META Refresh Redirection

client-based, META tag within a file


  <META http-equiv="Refresh" content="10; URL=target.html">



New Standards


Non-Standard File Names

Names with Extensions Other Than .html and .htm

Punctuation in URLs

can include letters, numbers, . (period), ~ (tilde), - (dash), and _ (underscore).

Robots should also handle file paths with the delimiters used in standard URLs:

Some URLs, usually those involved with dynamic data, have other punctuation marks. These are:


Relative Links

type

originating file

link

points to

simple file name: start at this level and find another document

/test/two.html

<A HREF= "three.html">

/test/three.html

name followed by a / (slash) and a file name: start at this level, go into the directory, and open the document

/level1/four.html

<A HREF= "level2/six.html" >

/level1/level2/six.html

/ (slash): start at the top level of this host

/test/two.html

<A HREF= "/five.html">

/five.html

../ (dot, dot, slash): start at the next level up, in the parent directory

/level1/level2/six.html

<A HREF= "../four.html">

/level1/four.html

./ (dot slash) start at the current directory (this may be a typo)

/test/two.html

<A HREF= "./three.html">

/test/three.html



The Site Map Solution, and its Limitations


Dynamic Data


Indexing Dynamic Data


Dynamic Web Applications and Black Holes


Detecting Duplicate Pages


Re-crawling a Site to Update the Index


Tracking Robot Spiders


Conclusions

 


Useful Links

SearchTools.com Robots pages

Web Robots pages

HTML Author's Guide to the Robots META tag

HTML 4.0 robots appendix

Robots Mailing List

Email to LISTSERV@MAIL2.MCCMEDIA.COM and in the TEXT write: SUBSCRIBE robots

UKOLN Robots.txt checker

BotWatch robots.txt checker

BotSpot robots pages

SearchEngineWatch SpiderSpotting Chart

W3C Web Content Accessibility Guidelines


Presentation to the Information Acess Seminar, UC Berkeley SIMS (School of Information Management and Studies) in November, 1999.

Avi Rappoport
Search Tools Consulting, SearchTools.com

robot indexing test link