There's a paradox: the more information your site has, the more useful it is -- and the harder to navigate! No matter how well you design your site navigation elements, visitors will need other ways to find what they're looking for. Site search tools provide a powerful and familiar means to provide that access. Visitors can just type the words and press the Search button in a form, and get a list of all the documents that match those words on your site.
Luckily, you don't have to write this yourself. There are many site search tools available, for almost every platform, web server and site you can imagine. They range from free to very expensive, from easy graphic interfaces to compile-it-yourself. The information here will give you a head start in choosing the right site search tool for your site.
|In This Guide||Other Useful Pages|
- Search Engine
- The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results page to the server.
- Search Index File
- Created by the Search Indexer program, this file stores the data from your site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of your site, this file can become very large. It file must be updated often, or it will become unsynchronized with the pages and provide obsolete results.
- Search Forms
- HTML interface to the site search tool, provided for visitors to enter their search terms and specify their preferences for the search. Some tools provide pre-built forms.
- Search Results Listing
- HTML page listing the pages which contain text matching the search term(s). These are sorted in some kind of relevance order, usually based on the number of times the search terms appear, and whether they're in a title or header. Most results listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). Some also include the date modified, file size, and URL. The format of this is often defined by the site search tool, but may be modified in some ways.
As you can imagine, setting all this up takes some time and effort. But before you can do so, you must choose the best site search tool, and that requires some design choices.
Site search tools will require additional disk space and processing power. Search indexes never get smaller, so be sure there is space to spare.
In addition, the you must plan to update the search index soon after you've changed the files so that searches will locate the correct data: happily, most of the site search tools provide an automatic update scheduler.
When someone searches your site, the results listing is very different from the pages themselves. The list usually contains page titles and some kind of text, either the Meta Description data, the first few lines of the page, or a programmatically generated summary of the most important text. In addition, the listings are sorted by the search engine in order of relevance, according to its particular algorithm.
You can present your data well and help your visitors find what they're looking for by keeping search results in mind when you edit your pages. Note that these improvements work for both local and webwide search tools: the work you do will make your pages appear better in any search results.
The titles are the main element in a result listing, so always title your pages carefully. Give a little context as well as the specific topic of the page, and always make sure the spelling is correct. In addition, most search engines use the existence of a word in a title as a clue that the page is a good match for searches on that word, and will rank the page high up in the results list.
For example, if your site is about native plants, use "Native Plant Directory: California Live Oak" instead of just "Live Oaks" as your title. They're equally accurate, but the longer title tells your people what to expect on that page when they look at a results listing: it's not about Southern Live Oaks, and it's not about growing or protecting them.
You should also use the Meta Description tag to summarize the contents of each page. Many local and webwide search engines will display this as part of their results, so it provides you an opportunity to present the page in its best light. This is easier than it looks -- you'll find that many of your pages can use very similar descriptions with just the specific topic words changed.
An example of a good Meta Description for the Live Oak page might be:<META NAME="description" CONTENT="Description of the California Live Oak with pictures, range map and growth patterns.">
As you can tell, creating a description of other pages, such as the Coast Redwood or the Douglas Iris, would be extremely easy. Use the same text and change the plant name, adding or removing the other parts depending on the contents of the page.
Keywords are also an important part of your pages. They allow search engines to identify the most important elements of the page and to rank the results so that the most relevant pages are at the top. You can also include common misspellings or other words that may not appear anywhere on the page. A good set of keywords encapsulates the specific topics the page covers.
An example of Meta Keywords for the Live Oak page would be:<META NAME="keywords" CONTENT="California Live Oaks, Coast Live Oak, liveoak, Quercus Agrifolia, oak woodlands, range map, native plants, native trees">
This describes the topics on the page and means that it would be retrieved if someone does a search on any of these terms, even if they are not in the text.
Meta Keywords help search engines define the relevancy of a match. If the word "white" is anywhere in the text, the search engine will retrieve the Live Oak page on a search for "California White Oak". But because the word is not in the Meta Keywords, it can rank this page lower than others which have "white" as a keyword.
Many search engines also use headings to rank a page in relevance for a particular search. They assume that words in headings are more important than words in the text, so the pages are more relevant to that search.
For example, if you search for "Oak" and "Range", a page with both those words marked with HTML header tags, the search engines will rank it higher than pages with those words in the text.
Consider vocabulary when you create pages, and think of your headers as small descriptions of those sections.
An excellent example of a site search requirements, analysis, selection and installation process is available at the University of Pennsylvania's web team area.
They have kindly allowed others to view their information and notes, providing a model of the procedures they followed from late 1996 through the installation of AltaVista Search Intranet in 1997. They mention that products and features have changed since then, so the results might well be different if they were going through the same process today.
- Create a plan and schedule.
- Screen available search tools based on compatibility and requirements.
- Define preliminary end-user requirements
- Check existing listings of site search tools
- Choose the most appropriate options for additional research
- Develop technical requirements document for end-user needs (boolean searching, results listings, etc.), administration, cost of ownership, vendor reliability, hardware and OS compatibility.
- Evaluate options based on requirements in a table.
- Install test versions of the final candidate products.
- Perform automated and manual user tests and evaluate results.
- Define and develop required local customization.
- Install and publicize the new search tools.
Disclaimer: The University of Pennsylvania makes no commercial endorsement of any consulting service or specific product.
Another good example of the process of choosing and installing a site search tool, in this case covering several Education Department sites. The group set up a requirements document, and tested Netscape Catalog (later replaced by Compass Server), InQuery, Verity Search '97 and Ultraseek, which they ultimately chose.
Follow the instructions carefully and take copious notes as you go through the steps. You may not have to reinstall the software for months or even years, and it's often difficult to reconstruct your work. The notes will also help if you install an upgrade to the software.
The search engine is the application which searches the data and returns the results to the client. This usually means creating an HTML page in the specified format.
Most search engines search within an index, created by an Indexer application. A few just search the files in real-time, but that can get very slow.
To send a search to the search engine, most systems include forms. The site visitor enters their search terms in a text field, and may select appropriate settings in the form. When they click the Submit button, the server passes that data to the search engine application.
To understand how a search engine works, you may want to look at a very simple case, such as the Perl indexer and search engine described in the WebMonkey article Roll Your Own Search Engine.
- The Common Gateway Interface (CGI) standard allows a web server to communicate with external programs.
Most site search CGIs are invoked by a site visitor filling in data and clicking a Search or Submit button on an HTML form. They take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page.
CGI programs can be written in everything from C to Perl to AppleScript, depending on the web server and the platform. Many CGIs are portable from Unix to Windows and Macs, depending on the language and libraries they use. CGIs are compatible with many different web servers, but there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. See also Plug-Ins.
For more information on CGI concepts, see the CGI overview at NCSA.
- Perl is a scripting language, and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. You can use Perl scripts on most platforms and with most web servers.
Several web site search tools are written in Perl: see the Perl listing for details.
For more information, see the Perl Institute.
- For better data interchange, less overhead and more flexibility, web server companies have defined APIs (Application Programmer Interfaces) to their servers. This allows third-party developers to create modules for the servers which run inside the server process.
Several web site search tools are written to various server APIs. They are rarely portable and generally compiled to binary object code.
- Applications, written in the Java language, which runs in the Java Virtual Machine. Applets are small Java applications which run inside the browser program.
- Applications written in Java using the Java Servlet API. Many web servers now exchange data with Java applications using this interface, much like the CGI system. Because Java is designed to be cross-platform, many of the Java Servlets can run almost anywhere.
For more information, see
- Some search engines run as separate servers. The form data is passed as part of the URL, just like a URL, but the search engine application runs as a separate HTTP server on a different machine. This reduces the load on the main web server substantially.
Web site indexers must be able to save files in a web server directory, so that the search engine can locate it when a site visitor wants to search it. Remote search engines store the index files on their server, where it is used by the search engine when the user starts searching.
- An inverted index stores a list of entries made up of the search term (all the words in a page) and a pointer to the page that contains that search term. There may be other information as well.
For local file indexers, this pointer is a local file path: for robot spider indexers, it is a URL. Some indexers store additional structured data in a database, so the inverted index points to the database entry which points to the page.
The application sorts the data on the term so that the search engine can locate the matching terms extremely quickly. This is called "inverted" because the term is used to find the record, rather than the other way around.
- An index may contain gaps to allow for new entries to be added in the correct sort order without always requiring the following entries to be shifted out of the way.
Indexers must also update the index periodically, to stay synchronized with web pages as they change. Indexing can be extremely CPU-intensive, so you should schedule indexing for low-usage times. For sites which change quickly, the indexer must be very fast and efficient. Most indexers perform incremental updates, indexing the changes rather than starting from scratch every time. Some will not accept searches while the index is updating, others display the old index until the update is available, and still others attempt to provide new information as soon as it has been indexed.
Many local indexers can read and index non-HTML files such as PDF, word processing, spreadsheet, presentation, accounting and even database files. They use filters or viewers to translate the data, then index it normally.
Most search tools just provide links to the non-HTML files in the results list. When a searcher clicks on the link, their browser will probably just start downloading the file. Browsers and servers can only display files in formats they know about, such as text and HTML, or for which they have a Plug-In, such as Shockwave. So there is nothing to do but download the file. In most cases, that's not what a user is expecting, and it breaks the flow of attention, especially on a slow modem.
Some search tools, including Verity, Hummingbird and Google Search Appliance, will automatically convert the data from its native format into HTML and will serve it like that. While the formatting will probably be a bit awkward, displaying the data directly (even though it requires time to convert), keeps the user's attention and lets them stay in their browser instead of opening another application.
The "User Experience" covers more than the interface -- it means the system's interactions with the user, both in the forms and results interface and the functionality behind the scenes. The basic principals of clarity and familiarity that apply to all interfaces, but we recommend substantial user testing to make sure that they are implemented properly.
See our page, Web Site Searching and the User Experience, which covers the requirements for search forms, results pages, results listings, search quality and integration with information architecture. From a presentation by Avi Rappoport to the SF Bay Area Computer-Human Interface SIG of the ACM.
Articles and Sites on Search User Experience Issues
- UseIt.com, Jakob Neilsen's discussions of web site usability issues, includes information on site search, especially the Search Usability article.
- UsableWeb - Searching Section has links to many good articles on computer-human interface and search issues.
- Archives of the CHI-WEB mailing list - electronic discussion of computer-human interface issues on the Web including search.
- Looking for Something? Peter Bickford, Netscape View Source Magazine
Describes problems in searching with inflexible systems and suggests smart solutions that provide useful answers.
- Clarifying Search: A User-Interface Framework for Text Searches Ben Schniederman, Don Byrd and W. Bruce Croft, D-Lib magazine, January 1997.
Covers the principals of designing helpful search interfaces, good results, and testing to make sure it all makes sense.
- User-Centered Iterative Design for Digital Libraries Nancy A. Van house, et al, D-Lib Magazine, February 1996.
Describes how the team improved searching for an online database of color images by including user testing at all stages of the process.