As of January, 2012, this site is no longer being updated, due to work and health issues
See the News page for more recent information.
High-End Search Engine Review Coming Soon
We've been busy here at SearchTools working on a review of some of the higher-end search tools, and hope to publish our results in the new year. These search engines can index and search larger sites, over 100,000 pages, and require a significant investment for installation and ongoing maintenance. We're looking primarily at AltaVista Search, Ultraseek, Verity, FAST Search, Webinator, Excalibur Retrievalware, along with the free open-source engines ht://Dig, PLWeb and Isearch and probably Microsoft Site Server. We'll be looking at compatibility, installation, administration, indexing, search results, customized search forms and results pages, and maintenance issues, and trying to identify each products strengths and weaknesses.
If you have any comments on these or any other search engines, please send them to us. We hope to include some of the most useful information in our reviews, so please let us know whether your comments are for publication. Your name, company, and email address will be kept private, unless you provide explicit permission for us to release them.
Remote Search Services as ASPs
Outsourcing to Application Service Providers is the latest trend on the Web, and it can be the right solution for many sites' search needs. Inktomi, for example, has always pursued the service model, providing the back end search engine for HotBot, Yahoo!, Microsoft and other broad-based portals. Working with Aeneid's EoCenter, Inktomi search services also appear on more focussed technical, financial and medical sites. Now Google has announced that they are offering search services, aiming for very large sites and portal targeted search engines. We have not been able to test the Google service yet, but will report on it when we can. As with all remote search services, the advantages of having someone else deal with the server may well overcome any concerns about control.
We've added reviews for two new remote site search services, IndexMySite from Tippecanoe, and MiniSearch from SiteSurfer. Both are free, and reasonably solid performers worth checking out if you like the default examples. IndexMySite allows you to format your results page using CSS, so if you've already set up style sheets on your site, this may be a good solution for you.
Linux Search and Non-Text File Formats
Linux is a wonderful server platform, and several of the search engines will run on various flavors of Linux. However, they have been unable to index office productivity files, because there were few translators for the file formats. Verity has recently announced Linux versions of their K2 search toolkit (code library) and their HTML Export library, which reads file formats from vendors including Microsoft, Lotus, Corel and Applix. Once a file is in HTML, a search engine can index and search the contents, so a programmer could add this feature to any search engine. Inso has also announced file conversion products for Linux.
(PDF file, 48K) A Method for Intranet Search Engine Evaluations by Dick Stenmark, in Proceedings of IRIS22; Kþkšlþ, T. (Ed.), Keuruu, Finland, August 7-10, 1999.
A thoughtful and practical approach to evaluating server search engines, grouping functions and weighting according to importance. The methodology provides checklists of important issues and was tested in several corporate situations.
- Robots and Spiders and Crawlers Ultraseek White Paper, September 1999 by Avi Rappoport
- [Choosing a Search Engine for INRIA]: Rapport Final (in French) INRIA, July 8, 1999 by Francis Avnaim
Report of a committee charged with choosing a search engine for INRIA (French National Institute for Research in Computer Science and Control), covering over 200,000 pages. Mainly compares AltaVista Search with Verity, and finds several advantages in AltaVista.
- Ultraseek vs. AltaVista Comparison (Draft) Michigan State University, March 11, 1999 by Edward Glowacki
Report on the advantages of buying a backup AltaVista Search system or switching to Ultraseek. Main advantages for AltaVista are smaller index sizes and installed base: problems include cumbersome upgrades, config files, difficult customization, searching specific campus zones. Ultraseek's advantages are in the browser administration, index mirroring, zone (collection) system, and Content Classification Engine, but the index size is much larger.
New and Updated Search Tools
- Microsoft Site Server Search - a component of a larger content-management system, but the search engine in Site Server is quite capable, and the system can index using a robot spider, local file system, Exchange public folders, Access databases, etc.
Remote Search Service Review
Remote site search services removes many of the technical barriers to adding search to a site. They use a robot spider to follow links and index a site, and store the index on their server. When a site visitor enters their search terms into form, the remote server matches the words in the index and generates a results page with links back to the original web site. Many of them provide a free version, with a logo or advertising banner, and a paid business version without advertising.
Our new Remote Site Search Review covers nine of these services, describing their robot spider capabilities, index features, search features, results page and results match item customization options, maintenance and reports. We will update the review periodically to include new services and features.
Search Tools Survey
We have a second round of Survey Results from the SearchTools survey, with 261 entries as of September 9, 1999. The survey has been on the SearchTools site since December 1998. You can still take the survey, as we will update results periodically.
Our second, larger sample had the same general reasons for installing search as the first: they did so to improve navigation and make their site look professional, although more are reporting that they see search as a component of customer service. Of the sites that do not have search, the large majority (72%) say they don't have time, don't know how or the installation process is too complex.
Site size, audience and content affects how web site administrators approach search and what makes them more likely to implement it.
Again in this second round, we found that larger sites (over 1,000 pages) tend to have search: of the 55 sites with more than 5,000 pages, 39 had implemented search.
We found something of a correlation between content change and search implementations: sites which change frequently (hourly or daily) are much more likely to have search installed than those which change monthly. It's very difficult to add content navigation links constantly, so we believe this trend shows search engines providing vital access to this frequently-changing data.
Sites with certain audiences, such as medical professionals, tend to have search, perhaps to conform to the audience expectations. We were surprised at the number of sites which targeted information professionals but did not have search installed.
We are seeing many more languages in our survey, including Finnish, Pilipino, Yiddish and Chinese. There is a minor trend for sites with non-English text to have search installed. We found a significant correlation between the number of languages on a site and search installation: a majority of surveyed sites with more than one language have search installed, and those with three to seven languages almost all had search.
The other important factor in search installation is the location of the server. It's much easier to install a search server script or application on a local machine, so sites with in-house servers were much more likely to have search. Those which are co-located at an IS and those hosted by a web presence provider tend not to have search installed.
The survey also asked site administrators to rate their search tools and comment on their strengths and weaknesses. There was a lot of variation, and the Survey Products Ratings Page contains the ratings and comments about what they did and didn't like about the products.
Search Indexing Robots & Spiders
Many search engines use programs called robots to gather web pages for indexing. These programs are not limited to a pre-defined list of web pages, they can follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering, wandering, or gathering. We have a summary of the problems these indexers encounter, information about communicating with robots (robots.txt and META ROBOTS tags), and links to additional information on robots.
With new entries for everything from Classification to Zones, the Glossary page is designed to be an introduction to important concepts in searching, web site analysis and information retrieval. If you have a question about a term, please let us know and we'll try to add it to the page.
New and Updated Search Tools
IndexMySite remote search service comes from Tippecanoe, who also make the Tecumseh Scout local search engine. If you use Cascading Style Sheets on your site, you can use them to customize your results page.
MiniSearch remote search service has a dramatic black background with yellow text.
Trident Search Site Server is a 100% Pure Java search engine, related products include an OEM server, Java class libraries and servlets, custom versions.
Ultraseek version 3.1 released - indexes and searches Microsoft Exchange Public folders, handles Windows NT challenge-response authentication for specific documents, supports a customizable thesaurus, easy integration with ad servers Net Gravity and Accipter, enhanced XML DTD mapping for field searching and more.
DocFather - Professional edition now in version 2.6.1 includes meta indexing, sorted results, faster indexing and additional customization options.
Natural Language Processing in Searching
To avoid forcing searchers to memorize Boolean or other query languages, some systems allow them to type in a question, and use that as the query: this is known as "Natural Language Processing" (NLP). The simplest processing just removes stopwords and uses a vector search or other statistical approach. Some sophisticated systems try to extract concepts using linguistic analysis, and match those against concepts extracted by the indexer. Others try to categorize the form of the question and use it to define the query, so "who is" questions are not treated the same as "how many" or "why". For more information, see the NLP section of the Information Retrieval page and the Glossary page.
New & Updated Site Search Tools
atomz remote search service will now index Adobe Acrobat (PDF) documents.
Search Engine Standards Project
Danny Sullivan of SearchEngineWatch has started the Search Engine Standards Project to encourage all search engines to support the some basic standard functions, which make it easier for researchers. Formal participants include representatives from the largest webwide search engines as well as academics and industry analysts. The first announcement provides helpful background on the problem and some early proposals, and the discussion forum allows others to provide ideas and feedback.
All web site, portal and Intranet search engines should implement standards and conventions when possible, so that searchers can leverage their experiences. Search Engine Standards Project provides an excellent forum for search engine developers to find common ground, rather than attempting to compete on the most basic functionality.
The current proposal is for domain restriction, allowing searchers to limit results to pages on a particular domain by using the search parameter "site:". For example, if you want to search for XML, but only on the ProjectCool site, you might use the command:XML +site:projectcool.com
While single-site search engine could not use this option, Intranet and portal search engines should implement the command as soon as possible. It does not preclude a popup menu of sites or zones, merely extends the syntax of a search in a simple way, and provides a familiar function for search experts. Many webwide search engines have already implemented the "site:" parameter, and others have indicated they will do so soon.
Future proposals may involve additional commands for searching and tags for controlling search indexing robots. I will continue to report on the proposals as they come out. I urge all search engine developers and interested search administrators to track the project pages and get involved in the forum as well.
Information Retrieval and Search Conferences
For an overview of current implementations of information retrieval theory, I have written up a Report from the 1999 Search Engine Meeting. The presentations and discussions there provided insight into the variety of options for information retrieval, and the opportunities to go beyond basic text matching to locate the truly relevant materials.
Coming soon is the Research and Development in Information Retrieval Conference of the ACM's Information Retrieval Special Interest Group, in Berkeley, California, my home town. This meeting promises papers by leading theoretical researchers on information visualization, distributed retrieval, latent semantic indexing, cross-language information retrieval, summarization, scalability, multimedia retrieval, and so on.
I've been speaking a great deal recently, on somewhat less esoteric topics, and have posted a set of slides for my talk on understanding and choosing site, Intranet, and portal search tools. I will be speaking at Thunder Lizard's Web Design World Denver in late September and OnlineWorld 99 Chicago in late October.
There are many more conferences coming up: please send us your suggestions for other conferences to include.
Information Retrieval Books
- Modern Information Retrieval Ricardo Baeza-Yates & Berthier Ribero-Neto, Addison-Wesley Longman, May 1999, ISBN 020139829x, $50
- Introduction to the current state of information retrieval, including changes brought by the Web to a field that was previously oriented towards academia, libraries and corporate networks. Most of the book is not online, but the chapter on User Interfaces and Visualization in Modern Information Retrieval, written by UC Berkeley professor Marti Hearst, provides a valuable academic study of interfaces for information retrieval and searching, including graphical overviews and visualization. Get the book from Amazon or FatBrain.com and we'll get an affiliate fee.
- Managing Gigabytes: Compressing and Indexing Documents and Images (2nd edition) by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Morgan-Kaufmann Publishers; April 1999; ISBN 1558605703, $54.95
- Covers the problems of very large document collections, including compression, indexing and querying options. Praised by Steve Kirsch of Infoseek, among others. See also the MG web site and the excellent reviews. Buy it from Amazon or FatBrain.com and support this site.
XML Search Issues
Text searching is still different from database searching, but most XML search engines don't recognize this: they look for simple matches in specified fields (missing the free-text synergies) and do not provide relevance rankings in results. Perhaps they've missed the forty years of research in Information Retrieval just described.
We've updated the XML and Search page with additional issues and articles, and found some some interesting XML search systems:
- GoXML Search Engine does XML-specific search by providing a second step for the query with a popup menu of "context": the markup tag for the text. There does not seem to be a way to perform free-text searching as well, but it does seem to index by using a robot and spidering external sites.
- XSet is an XML database and high performance search engine library with a very simple tag-oriented query language, expressed in XML itself.
- UC Berkeley's Cheshire system indexes and searches structured data including XML, SGML and MARC records, as well as full-text data files.
- XML.com has just published a couple of articles on creating XML metasearch engines (searching other XML databases).
Search User Experience Thoughts
Jakob Nielsen of useit.com has made an important point that bears repeating: the Web has found some interface conventions. While it was brand-new, the Web was wide open, there were no guidelines and each site was inventing the interface as they pleased. Now there are some de-facto standards that users expect, such as colored and underlined links, navigation on the top and left sides of the page, a "home" link for the site root page, and so on. When you break these conventions, you risk confusing and losing many users. The same applies to search interfaces: user expectations have been defined by the large webwide search engines, and provide you with a useful framework for presenting your search forms and results listings. Study the interfaces for Hotbot, Infoseek or Lycos, and you'll see that they've thought hard about how to present these options and data, and tested with millions of users. This is not to say that you should not be creative: just make sure that you understand what elements are different in your presentation, that the benefits are worth the risks, and test, test, test!
In the same vein, Jeffrey Veen of HotWired and Wired Digital has pointed out a new interface convention, which he calls "LSD": Logo, Search, Directory. Pioneered by Yahoo, this layout is used by the front page of almost every portal site around, because it works! The Logo identifies the site; the Search field allows experts and power users to jump directly to the item they want; and the Directory provides both an overview of the coverage and vocabulary of the site, and a simple click-through entrance for the confused. This interface provides a sense of familiarity to site visitors, and allows them to locate their desired content in comfort.
Even following layout conventions, however, does not guarantee a usable site. For that you need solid and scalable information architecture, thoughtful navigation design, a search engine that can index and retrieve your pages with useful relevance ranking, smart catalogers to put pages in the correct directory categories, graphics that support the content rather than overwhelm it, and a commitment to usability testing. Testing should be a part of the design process from the beginning, with several small-scale tests providing early feedback. With search, you have an additional advantage: your search logs should indicate whether the users understand the options in the search forms. If not, you can fix them quickly!
New & Updated Site Search Tools
atomz provides a free search service for up to 500 pages with no advertising beyond a small banner logo, paid service for additional pages, scheduled updates more than once a week, and no logo. The main advantage of this service is the advanced layout control over the results pages, which includes an HTML-like set of tags for locating and displaying results.
Perlfect Search a capable and functional Perl CGI for NT and Unix distributed for free under an open source license. It will index and search sites with up to approximately 2,000 documents. Can handle extended and diacritical characters, and there's an older version that's compatible with BIG5-encoded Chinese text.
PLWeb Turbo has released a new version, 3.0 for Windows NT with improved performance, customization, web-crawling capability, and a browser-based interface. AOL (which bought PLS) has also released free source code for the Unix versions of this product, allowing additional development. They will not be porting the source to NT.
SurfMap Search is a set of Java applets creating a list-like searchable site map.
Super Site Searcher Perl CGI works with other modules to create searchable site directory.
Virage is a high-end multimedia search engine can index and search speech files and video in real time. It indexes using speech recognition, closed captions, text titles in videos, and other subtle cues for tracking information flow.
Site Search Tools Survey
First results of our Site Search Tools Survey are in! The survey has been on the SearchTools site since December 1998, and we have tabulated the results as of March, 1999 on the Survey Results page. You can still take the survey, as we will update results periodically throughout the year.
We wanted to know why Web site administrators may or may not install search on their sites. Those who have say they did so to improve navigation on their sites. Of those who haven't, a majority says that they haven't had time, or the search engines software is too complex. Only a few say that there is no need on their site, or that their site doesn't serve enough content.
Site size, audience and content affects how web site administrators approach search and what makes them more likely to implement it. As we expected, larger site tend to have search, to provide an alternate navigation path. Sites with certain audiences, such as medical professionals, tend to have search, perhaps to conform to the audience expectations. We were surprised at the number of sites which targeted information professionals but did not have search installed. The sites in our survey with three or more languages, which also tended to be large sites, all have search. And a large number of sites are now serving PDF, word processor files and spreadsheets, all of which can be indexed by some search engines.
The other important factor in search installation is the location of the server. It's much easier to install a new server script or application on a local machine, so sites with in-house servers were much more likely to have search. Those which are co-located at an ISP, and especially those hosted by a web presence provider tend not to have search installed.
The survey also asked site administrators to rate their search tools and comment on their strengths and weaknesses. There was a lot of variation, but Ultraseek, ht://Dig, Phantom and Webinator were the top-ranked site search tools. The Survey Products Ratings Page contains the ratings and comments about what they did and didn't like about the products.
XML and Search
XML is very much in the news these days, and some people are saying it will solve all sorts of problems with web and site searching. Our article on XML and Search analyzes these claims, describes how web sites and search engines must change to take advantage of XML and highlights the differences between text searching and XML Query Languages. The XML Resources page provides listings of XML text search engines, Query Languages, general resources, and registries for sharing DTDs, Schema and structural information.
Search Engine Standards
The Search Engine Standards Project is a new group, headed by Danny Sullivan of SearchEngineWatch, designed to foster search interface standards among the major search services. The first proposals are to standardize the use of one term for limiting search to a single web site, and another one for searching within URL links. While Danny is talking to web-wide search developers, we'd like to urge site search tools developers to sign on as well.
New & Updated Site Search Tools
AltaVista (Windows NT and Unix search tool) has just introduced a free version of AltaVista Search Intranet, Entry Level, which will index up to 3,000 pages. Sites with existing licenses can download the Entry Level version, then add their license key to increase the number of pages, so a 2,000 page site can bump up to 5,000 pages. [this has been discontinued in 2001]
Boolean Search, (a Mac site search tool), has released version 2.1.2. New features include better caching and memory improvements, flexible inclusion and exclusion of files and text in specified tags and more options for results display.
Harvest (open source) one of the early web indexing robots, has been revived. An open-source project, it has been re-implemented in Perl and can summarize documents in SOIF (Summary Object Interchange Format). This version saves the data in a database file and does not include a Broker or search engine, but it is entirely extensible.
intraSearch (a remote search service) has recently added features including indexing on command, support for framed sites, displaying meta "abstract" tag data, and indexing accented characters.
MondoSearch (Windows NT search tool and remote search supplier to ISPs) is now up to version 3.3.1, now allowing background graphics, more control over the robot indexer, duplicate detection, better META tag handling, results page layout improved, and extensions to the frame-indexing features.
SearchButton, (a remote search service) is now in Beta 2, offering improved customization of search forms and results lists, and robot indexer improvements.
Ultraseek Adds Linux Compatibility
The Ultraseek search engine and the Content Classification Engine now run on Linux Redhat Linux 5.1 on a PC, Kernel 2.0.34 or better, or glibc 2.0.7-19 or better. Other Linux search engines include Alkaline, Excite for Web Servers, Extense Station Pro, FAST Search SW, freeWais-sf, ht://Dig, IB 2.0 WebCat, ICE, ISearch, MPS Information Server, WebGlimpse and Webinator.
SearchButton.com Announces Remote Site Search Hosting Service
SearchButton.com has announced their "zero-footprint" site search service, based on a proprietary robot spider and the Verity search engine technology. The service will provide both basic and advanced search forms, and extensive administration logs for analysis. Like all remote site search services, it is compatible with Web servers on all platforms. The service is now in beta test for free: there will be a free version supported by advertising, a basic version (cost not yet determined) and additional services available.
Maintaining Context in Search Results
Wouldn't it be great if there was a simple way to navigate search results matching pages? Peterme (Peter Merholz) discusses how he helped the greeting-card store Sparks.com design a search system where the results context is carried into the detailed pages. After searching and clicking on one of the thumbnail pictures that make up the results list, you see a large picture of the card but also a small area entitled "NOT QUITE RIGHT?" which shows thumbnails of the previous and next cards, and has a button to return you to the results list. This is an innovative and clever solution to a perennial problem! Peterme.com is a trove of interesting articles and links -- I've used some of them to update the Information Architecture page.
SearchTools News - February 3, 1999
Directories Complement Search
You may want to create a directory of pages, such as those found on Yahoo or Snap, for your local site, intranet, or section of the Web. Directories allow site users to understand the scope and focus of the available information, and "drill down" from the most general to the specific. At each level, the category names and organization provide instant feedback to users. Rather than a list without any context, such as the results of a search, a set of category headings show how the page fits into the universe of information that you are presenting. Many sites should include both directories and search to accommodate many styles of information seeking.
There are a number of applications that can help you do this, although the amount of automation can vary. Some programs simply allow anyone to manually add a URL to a specific category by submitting a site. Others allow catalogers to create sophisticated rules to specify certain words and phrases which will place a page in a category. Still others attempt to automate the entire process, grouping pages into topics based on the contents.
For more information, see the new page on Directory Creation Tools.
Information Retrieval Research
For more information, links and other interesting projects, see the Information Retrieval page.
- Web IR and IE (Information Extraction)
A site with excellent links to conferences, books, research and other related topics in Information Retrieval research.
- Cha-Cha - a UC Berkeley research project, provides search results in outline form with the titles of the parent documents displayed, to provide a context. Rather than showing all the meta description / page start information in the results list, this system allows searchers to click on an icon and see that additional data in a frame on the right.
- Clever - a research project in IBM's Almaden Labs concentrating on providing the best and most authoritative information on a topic. It does this by tracking the number of links pointing to a site or page, and the number of links pointing to other popular pages.
Web Site Searching and the User Experience
Avi Rappoport gave a presentation to BayCHI ( the ACM's Computer-Human Interface group) in the SF Bay Area on January 12, 1999. She covered the issues of designing the search form, the results pages (including no-results issues), results listings, search quality, information architecture and general user experience with search. The presentation pages, with live links to example sites, are available, along with an excellent summary of the talk by Howard Tamler.
New & Updated Site Search Tools Reports
- AIAT (Sherlock) code library - provides indexing, search and summarization on files and data streams. Free from Apple.
- DocFather - update to this Java search applet now indexes PDF and other file formats.
- Extense - a powerful search engine developed in France which uses the syntactic declination of French words (masculine/feminine and singular/plural). This is used in the search engines at voila.com (English) and voila.fr (French).
- FAST Search - optimized for speed, this engine powers the Lycos FTP search and Lycos MP3 search sites.
- Inxight LinguistX code library - provides language identification, stemming and tokanization, among other features.
- JHLSearch - Java applet for web or local use, included by O'Reilly with some CDs.
- SiteSearch-GD - very simple Perl script only searches one directory.
- SiteSurfer - Java applet also creates a site map and text index.
- WAIS note: old versions of WAIS may not be Y2K-compliant: if you're running it, check your code.
- WebCat - provides an interface for library catalogs on the Web.
Remote Search Hosting Services
Remote searchers will crawl your site and store your info in an index on their server. When someone enters a query in the search form on your site, the link points at the search engine host. It receives the query,does the lookup in the index, formats the results, and sends them back in an HTML form with links directly to the pages on your site.
Some services provide simple, straightforward searches, while others offer powerful advanced search functions such as proximity operators (NEAR) and date-range searching. Some are free and supported by advertising, others charge by the page, by the search or monthly.
Remote search services do not require any programming or local access to the web server. They act like standard robot spiders, following links on your site, rather than using the local file system. For hosted sites with limited server access, these services are an excellent option.
We expect to have a full comparative report on remote site hosting services in February: in the meantime, you can try them out on our search test page.
Source Libraries for Search Code
Many of you may be interested in adding searching and text retrieval to your own applications. We've started a listing page for search source code, beginning with the excellent Findex package from LexTech and the dtSearch source package. You can also look at the open source search applications (listed on the same page), but be sure to check the license terms before including the code in your own programs.
New & Updated Site Search Tools Reports
For more news, see the Current News Page and the 1998 News Archive.