Home
Guide
Tools List
News
Search
About Us


Web Admin's Guide to Site Search Tools

How to Choose, Implement and Maintain your Web Site Search Tools.


There's a paradox: the more information your site has, the more useful it is -- and the harder to navigate! No matter how well you design your site navigation elements, visitors will need other ways to find what they're looking for. Site search tools provide a powerful and familiar means to provide that access. Visitors can just type the words and press the Search button in a form, and get a list of all the documents that match those words on your site.

Luckily, you don't have to write this yourself. There are many site search tools available, for almost every platform, web server and site you can imagine. They range from free to very expensive, from easy graphic interfaces to compile-it-yourself. The information here will give you a head start in choosing the right site search tool for your site.


In This Guide Other Useful Pages


Definitions: Parts of a Local Site Search Tool

Search Engine
The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results page to the server.
Search Index File
Created by the Search Indexer program, this file stores the data from your site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of your site, this file can become very large. It file must be updated often, or it will become unsynchronized with the pages and provide obsolete results.
Search Forms
HTML interface to the site search tool, provided for visitors to enter their search terms and specify their preferences for the search. Some tools provide pre-built forms.
Search Results Listing
HTML page listing the pages which contain text matching the search term(s). These are sorted in some kind of relevance order, usually based on the number of times the search terms appear, and whether they're in a title or header. Most results listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). Some also include the date modified, file size, and URL. The format of this is often defined by the site search tool, but may be modified in some ways.

As you can imagine, setting all this up takes some time and effort. But before you can do so, you must choose the best site search tool, and that requires some design choices.


Preparing A Site for Searching

Physical Requirements

Site search tools will require additional disk space and processing power. Search indexes never get smaller, so be sure there is space to spare.

In addition, the you must plan to update the search index soon after you've changed the files so that searches will locate the correct data: happily, most of the site search tools provide an automatic update scheduler.

Preparing the Pages

When someone searches your site, the results listing is very different from the pages themselves. The list usually contains page titles and some kind of text, either the Meta Description data, the first few lines of the page, or a programmatically generated summary of the most important text. In addition, the listings are sorted by the search engine in order of relevance, according to its particular algorithm.

You can present your data well and help your visitors find what they're looking for by keeping search results in mind when you edit your pages. Note that these improvements work for both local and webwide search tools: the work you do will make your pages appear better in any search results.

Page Titles

The titles are the main element in a result listing, so always title your pages carefully. Give a little context as well as the specific topic of the page, and always make sure the spelling is correct. In addition, most search engines use the existence of a word in a title as a clue that the page is a good match for searches on that word, and will rank the page high up in the results list.

For example, if your site is about native plants, use "Native Plant Directory: California Live Oak" instead of just "Live Oaks" as your title. They're equally accurate, but the longer title tells your people what to expect on that page when they look at a results listing: it's not about Southern Live Oaks, and it's not about growing or protecting them.

Meta Descriptions

You should also use the Meta Description tag to summarize the contents of each page. Many local and webwide search engines will display this as part of their results, so it provides you an opportunity to present the page in its best light. This is easier than it looks -- you'll find that many of your pages can use very similar descriptions with just the specific topic words changed.

An example of a good Meta Description for the Live Oak page might be:

<META NAME="description" CONTENT="Description of the California Live Oak with pictures, 
range map and growth patterns.">

As you can tell, creating a description of other pages, such as the Coast Redwood or the Douglas Iris, would be extremely easy. Use the same text and change the plant name, adding or removing the other parts depending on the contents of the page.

Meta Keywords

Keywords are also an important part of your pages. They allow search engines to identify the most important elements of the page and to rank the results so that the most relevant pages are at the top. You can also include common misspellings or other words that may not appear anywhere on the page. A good set of keywords encapsulates the specific topics the page covers.

An example of Meta Keywords for the Live Oak page would be:

<META NAME="keywords" CONTENT="California Live Oaks, Coast Live Oak, liveoak, 
Quercus Agrifolia, oak woodlands, range map, native plants, native trees">

This describes the topics on the page and means that it would be retrieved if someone does a search on any of these terms, even if they are not in the text.

Meta Keywords help search engines define the relevancy of a match. If the word "white" is anywhere in the text, the search engine will retrieve the Live Oak page on a search for "California White Oak". But because the word is not in the Meta Keywords, it can rank this page lower than others which have "white" as a keyword.

Headings

Many search engines also use headings to rank a page in relevance for a particular search. They assume that words in headings are more important than words in the text, so the pages are more relevant to that search.

For example, if you search for "Oak" and "Range", a page with both those words marked with HTML header tags, the search engines will rank it higher than pages with those words in the text.

Consider vocabulary when you create pages, and think of your headers as small descriptions of those sections.


Choosing a Site Search Tool


Site Search Engine Issues

The search engine is the application which searches the data and returns the results to the client. This usually means creating an HTML page in the specified format.

Most search engines search within an index, created by an Indexer application. A few just search the files in real-time, but that can get very slow.

To send a search to the search engine, most systems include forms. The site visitor enters their search terms in a text field, and may select appropriate settings in the form. When they click the Submit button, the server passes that data to the search engine application.

To understand how a search engine works, you may want to look at a very simple case, such as the Perl indexer and search engine described in the WebMonkey article Roll Your Own Search Engine.

Types of Site Search Engines

CGI Programs

The Common Gateway Interface (CGI) standard allows a web server to communicate with external programs.

Most site search CGIs are invoked by a site visitor filling in data and clicking a Search or Submit button on an HTML form. They take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page.

CGI programs can be written in everything from C to Perl to AppleScript, depending on the web server and the platform. Many CGIs are portable from Unix to Windows and Macs, depending on the language and libraries they use. CGIs are compatible with many different web servers, but there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. See also Plug-Ins.

For more information on CGI concepts, see the CGI overview at NCSA.

Perl Scripts

Perl is a scripting language, and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. You can use Perl scripts on most platforms and with most web servers.

Several web site search tools are written in Perl: see the Perl listing for details.

For more information, see the Perl Institute.

Server Plug-Ins

For better data interchange, less overhead and more flexibility, web server companies have defined APIs (Application Programmer Interfaces) to their servers. This allows third-party developers to create modules for the servers which run inside the server process.

Several web site search tools are written to various server APIs. They are rarely portable and generally compiled to binary object code.

Java Applications

Applications, written in the Java language, which runs in the Java Virtual Machine. Applets are small Java applications which run inside the browser program.

Java Servlets

Applications written in Java using the Java Servlet API. Many web servers now exchange data with Java applications using this interface, much like the CGI system. Because Java is designed to be cross-platform, many of the Java Servlets can run almost anywhere.

For more information, see

Search Servers

Some search engines run as separate servers. The form data is passed as part of the URL, just like a URL, but the search engine application runs as a separate HTTP server on a different machine. This reduces the load on the main web server substantially.

Compatibility

Search Options


Site Search Indexing

The Search Indexer is the application which reads the text of the documents to be searched and stores them in an efficient searchable form usually called the index (Microsoft calls it a "catalog").

Web site indexers must be able to save files in a web server directory, so that the search engine can locate it when a site visitor wants to search it. Remote search engines store the index files on their server, where it is used by the search engine when the user starts searching.

Local File Indexers

Local File Indexers locate the files to index by following the directory structure of the hard drive, usually starting with the web server root directory. They will index files based on their location in the directory, rather than following links. Most local file indexers allow you to limit the indexing by file name, type, extension, and/or location.
Updating Local File Indexes
When updating, local file indexers can check the system update date for the file, and only index new or modified files. Some indexers which are tightly linked to their operating system will even be notified about file changes and creation in the specified folders, and will only update index entries for those files.
Duplicate Files
Local file indexers tend to be good at removing duplicate pages, so search results don't show several copies of the same page. To do this, they must properly resolve symbolic links, shortcuts, aliases or other local file system ways of storing a reference to a file in another location.
Dynamic Elements
Local indexers will get the page exactly as it appears on the local disk. They will not include dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on, which can be a large part of a site. This can be an advantage if the dynamic elements are repetitive, such as navigation bars, and should not be indexed, or a disadvantage if the dynamic elements\ contain the content of a page. In addition, these pages will not be marked as modified unless the content of the page has changed, so they will only be re-indexed when necessary.
Security
You must be very careful about which files are allowed to stay in the indexed site directory. It's easy to index private and obsolete files by accident, allowing site visitors access to these files via the search engine. Even if the pages themselves cannot be read because they are protected by a password, unauthorized people could deduce the contents of these files by searching.

Robot Spider Indexers

Robot Spider Indexers locate files to index by following links, just like webwide search engine spiders. You specify the starting page, and these indexers will request it from the server and received it just like a browser. The indexer will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages. Most robot spider indexers allow you to designate several starting points, so even pages which are not linked from your main page can be indexed.
Because they use HTTP, robot spider indexers can be slower than local file indexers, and can put more pressure on your web server, as they ask for each page. They will miss pages which have been accidentally unlinked from any of your starting points. And spiders may have problems with framed sites, just like webwide search engine robots.
Updating Robot Indexes
To update the index, some robot spider will query the web server about the status of each linked page by asking for the HTTP header using a "HEAD" request (the usual request for an HTML page is a "GET"). The server may be able to fill the HEAD request from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the indexer compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn't have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word.
Dynamic Elements
Robot spider indexers will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.
Duplicate Files
Robots must contain special code to check for duplicate pages, due to server mirroring, alternate default page names, mistakes in relative file naming (./ instead of ../, for example), and so on.
Controlling Robot Indexing
Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. Webmasters can control which directories the robots should index by editing the robots.txt file, and web page creators can control robot indexing behavior using the Robots META tag.
 

Index Formats

Inverted Index

An inverted index stores a list of entries made up of the search term (all the words in a page) and a pointer to the page that contains that search term. There may be other information as well.

For local file indexers, this pointer is a local file path: for robot spider indexers, it is a URL. Some indexers store additional structured data in a database, so the inverted index points to the database entry which points to the page.

The application sorts the data on the term so that the search engine can locate the matching terms extremely quickly. This is called "inverted" because the term is used to find the record, rather than the other way around.

An index may contain gaps to allow for new entries to be added in the correct sort order without always requiring the following entries to be shifted out of the way.

Other Index Issues

 

Updating Indexes

Indexers must also update the index periodically, to stay synchronized with web pages as they change. Indexing can be extremely CPU-intensive, so you should schedule indexing for low-usage times. For sites which change quickly, the indexer must be very fast and efficient. Most indexers perform incremental updates, indexing the changes rather than starting from scratch every time. Some will not accept searches while the index is updating, others display the old index until the update is available, and still others attempt to provide new information as soon as it has been indexed.


Legacy Publishing and Searching Other File Formats

Many local indexers can read and index non-HTML files such as PDF, word processing, spreadsheet, presentation, accounting and even database files. They use filters or viewers to translate the data, then index it normally.

Most search tools just provide links to the non-HTML files in the results list. When a searcher clicks on the link, their browser will probably just start downloading the file. Browsers and servers can only display files in formats they know about, such as text and HTML, or for which they have a Plug-In, such as Shockwave. So there is nothing to do but download the file. In most cases, that's not what a user is expecting, and it breaks the flow of attention, especially on a slow modem.

Some search tools, including Verity, Hummingbird and Google Search Appliance, will automatically convert the data from its native format into HTML and will serve it like that. While the formatting will probably be a bit awkward, displaying the data directly (even though it requires time to convert), keeps the user's attention and lets them stay in their browser instead of opening another application.


Search User Experience

The "User Experience" covers more than the interface -- it means the system's interactions with the user, both in the forms and results interface and the functionality behind the scenes. The basic principals of clarity and familiarity that apply to all interfaces, but we recommend substantial user testing to make sure that they are implemented properly.

See our page, Web Site Searching and the User Experience, which covers the requirements for search forms, results pages, results listings, search quality and integration with information architecture. From a presentation by Avi Rappoport to the SF Bay Area Computer-Human Interface SIG of the ACM.

Articles and Sites on Search User Experience Issues


Maintenance and Updating


Search Log Analysis


Home
Guide
Tools List
News
Background
Search
About Us

SearchTools.com
Copyright © 1998-2001
Search Tools Consulting