Individual words (searchable as keywords)
Position on page (higher may be more important)
Paragraph tag (H1 may be more important)
comments
Alt text
Title
Meta Keywords
Meta Descriptions
Other Metadata (sophisticated engines only)
Retrieve all the pages around the starting point before following links further away from the start.
Distributes the server load among hosts
follows all links from the first link
may be easier to code
![]()
Spidering Depth
home page is on level 0
some engines only go a couple of layers deep
can be annoying if your data is pretty deep
Server Load
some robots hit the server very quickly
updating - HEAD command just requests the HTML header for page, compare date and size
updating - If-Modified-Since (CONDITIONAL_GET) - only gets page if it's been changed.
Password-Protected Pages
leave them out of search
may want to make a special search engine for private info only
local search show subscription information
Encrypted Data
HTTPS - encryption and authentication for banking information, patient records, student grades
public search engine robots can't deal with this (good).
local search engines may want to do this (carefully). Will have to secure index.
SearchEngineWatch SpiderSpotting Chart
www.searchenginewatch.com/webmasters/spiderchart.html
Task |
Entry |
Notes |
|---|---|---|
Allow robots complete access to the server |
User-agent: *
|
* means all user agents (robots). Because nothing is disallowed, everything is allowed. |
Exclude all robots from part of the server |
User-agent: *
|
* means all user agents. The robots should not visit any pages in these directories. |
Exclude a single robot |
User-agent: BadBot
|
In this case, the BadBot robot is not allowed to see anything. All
other agents (*) can see everything.
|
Exclude a robot from a single file |
User-agent: WeirdBot
|
This keeps the WeirdBot from visiting the listing page in the links
directory, while all other robots can see everything except the temp
and private directories.
|
In the HTML <HEAD> sections of individual pages
Task
Entry
Notes
Do not index, but follow links
<meta name="ROBOTS" content="NOINDEX">
Use this for pages with many links on them, but not much useful data. Because "follow" is the default, you don't have to include it.
Index, but do not follow links
<meta name="ROBOTS" content="NOFOLLOW">
Use this for pages which have useful content but links which may be irrelevant or obsolete.
Do not index or follow links
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
This is for pages which should not be indexed at all. If you put that in every page, the site should not be indexed.
Index and follow links
<meta name="ROBOTS" content="INDEX,FOLLOW">
This is the default behavior: you don't have to include this tag.
add on both the FRAMESET and the FRAME pages.
INDEX, NOFOLLOW is particularly useful for indexing local data only
<map name="client-side-map">
<area shape="circle" coords="91,49,38"
href="client-map-circle.html">
<area shape="rect" coords="180,19,293,78"
href="client-map-rect.html">
</map>
<a href="map/usa.map"><img src="img/usa.gif" ISMAP></a>
<FRAMESET cols="160,*" border=0> <FRAME src="/main.html" scrolling="auto" border=10 NAME="main"> <FRAME src="right.html" scrolling="auto" border=10 NAME="right"> <FRAME src="bottom.html" scrolling="no" border=0 marginwidth=0 marginheight=0 NAME="bottom"> </FRAMESET> <NOFRAMES> <BODY><H1>Welcome to our Site!</H1> <H2><A HREF="http://www.domain.com/main.html">Main Page</A></H2> <H2>Other Pages</H2> <UL> <LI><A HREF="trees.html">Trees</A></LI> <LI><A HREF="flowers.html">Flowers</A></LI> <LI><A HREF="mammals.html">Mammals</A></LI> <LI><A HREF="insects.html">Insects</A></LI> <LI><A HREF="fish.html">Fish</A></LI> </UL> <P><A HREF="home.html">Home</A></P> <P><A HREF="about.html">About Us</A></P> </BODY> </NOFRAMES>
<SCRIPT LANGUAGE ="JavaScript"> <!-- document.writeln ("<H3>JavaScript Example</H3>\ <P>This text is generated by a JavaScript. If you can see it, you are using a \ JavaScript-compatible \ browser or other HTTP client that contains a JavaScript interpreter. For \ more information, see our <A HREF='tech/javascript/'> \ JavaScript Pages</A>. <\P> \ ") // --> </SCRIPT> <NOSCRIPT> <P>This is only visible to browsers that cannot interpret JavaScript. For more information, see our <A HREF="tech/javascript/"> JavaScript Pages</A>. <\P> </NOSCRIPT>
In the NOSCRIPT section of the code, the link is available for robots and non-JavaScript browsers.
<H3>Fruit We Grow</H3> <FORM NAME="links"> <P><SELECT NAME="select" onChange="if(options[selectedIndex].value) window.location.href=(options[selectedIndex]Value)"> <OPTION VALUE="apples.html" selected>apples</OPTION> <OPTION VALUE="../berries/strawberries.html">strawberries</OPTION> <OPTION VALUE="../berries/boysenberries.html">boysenberries</OPTION> <OPTION VALUE="/exotics/kiwi.html">kiwifruit</OPTION> <OPTION VALUE="http://www.domain.com/fruitonline.html">more fruit</OPTION> </SELECT> </P> <NOSCRIPT> <P>If you don't have JavaScript, please use these links</P> <UL> <LI><A HREF="apples.html">apples</A></LI> <LI><A HREF="../berries/strawberries.html">strawberries</A></LI> <LI><A HREF="../berries/boysenberries.html">boysenberries</A></LI> <LI><A HREF="/exotics/kiwi.html">kiwifruit</A></LI> <LI><A HREF="http://www.domain.com/fruitonline.html">more fruit</A></LI> </UL> </NOSCRIPT> </FORM>
automatic transfer from one URL to another
Server Redirect Files
HTTP/1.1 302 Found Location: http://www.domain.com/new/specials.html
most clients can handle this, will use the new URL
make sure there's an alternate link to the new URL
META Refresh Redirection
client-based, META tag within a file
<META http-equiv="Refresh" content="10; URL=target.html">
content parameter is the delay interval
URL parameter is the path of the target page.
include a normal link to the target page.
Names with Extensions Other Than .html and .htm
no extension - HTML files formatted correctly by the server but missing the extension entirely
.txt - plain text files without HTML markup
.ssi - HTML files containing text inserted by Server-Side Includes.
.shtml - HTML files containing text inserted by Server-Side Includes.
PL - HTML files generated by Perl on the server.
.cfm - HTML files generated by Allaire Cold Fusion on the server.
Asp - HTML files generated by Microsoft Active Server Pages on the server.
.lasso - HTML files generated by the Lasso Web Application Server.
Punctuation in URLs
can include letters, numbers, . (period), ~ (tilde), - (dash), and _ (underscore).
Robots should also handle file paths with the delimiters used in standard URLs:
/ (Slash, delimiting directory hierarchy)
. (Period, delimiting host name elements)
: (Colon, delimiting port numbers)
# (Hash or Pound Sign, delimiting anchor links)
% (Percent Sign, for encoding spaces and other special characters)
Some URLs, usually those involved with dynamic data, have other punctuation marks. These are:
? (Question Mark or Query): defined by the HTML standard for sending form data to the server using the HTTP Get function
$ (Dollar Sign), often used for the same purpose
= (Equals Sign): used in the same standard for control name/value pairs
& (Ampersand): used in the same standard for separating name/value pairs
; (Semicolon): also used for separating parameters

type |
originating file |
link |
points to |
|---|---|---|---|
simple file name: start at this level and find another document |
/test/two.html |
<A HREF= "three.html"> |
/test/three.html |
name followed by a / (slash) and a file name: start at this level, go into the directory, and open the document |
/level1/four.html |
<A HREF= "level2/six.html" > |
/level1/level2/six.html |
/ (slash): start at the top level of this host |
/test/two.html |
<A HREF= "/five.html"> |
/five.html |
../ (dot, dot, slash): start at the next level up, in the parent directory |
/level1/level2/six.html |
<A HREF= "../four.html"> |
/level1/four.html |
./ (dot slash) start at the current directory (this may be a typo) |
/test/two.html |
<A HREF= "./three.html"> |
/test/three.html |
URL after ? is a query to a database
http://www.tfaw.com/Level4.html?SEC=HERE&SUBSEC=toys&ITEM=ARR1000000397
generate static text
fake a standard URL format
use a search engine that handles question marks