As of January, 2012, this site is no longer being updated, due to work and health issues

Analysis & Review of the Webinator Search Engine

Review by Avi Rappoport, Search Tools Consulting, April 24, 2007

In this review, I cover every aspect of the Thunderstone Webinator search engine, looking at what's possible, what's special and what's missing. I've been much helped by the posts on the Webinator support mailing list and the frank answers from Thunderstone's representative, as well as several working indexes on one of their test appliances.

Basics and Capacity

Webinator has a great deal of power and flexibility. It can store multiple indexes for many kinds of sources, including web sites, file servers and databases, and combine results without losing relevance ranking. It can scale up to multiple replicated servers and hundreds of queries per minute, as has been proved in several extremely-high-traffic installations, including alibris.com, QVC and their own metasearch engine (since sold), Dogpile.com.

Thunderstone's Webinator search engine comes in many different flavors:

Note that Thunderstone software licenses are perpetual, major upgrade cost less than new software, and that maintenance and support are available at 18% of purchase price per year.

Indexing

Webinator has extensive tools for locating documents and storing content for later searching.

Sources of Content

The Webinator crawler has been around for a long time, and can handle most complex issues such as relative link errors, meta refresh, forms, image-maps and redirects. The JavaScript decoding is quite effective, according to my tests of common JavaScript programmatic links. The editable "user agent" string sent by the robot which identifies it -- this will appear in web logs and should include contact information so a webmaster knows how to ask for exclusion or report problems.

In addition to a very powerful standard web spider, Webinator can index using ftp, and gopher (a very very old pre-Web file sharing protocol), and accessible local file servers as long as they can be viewed via a browser using the "file" protocol.

It has options allowing very detailed definition of which files to include and exclude: by various lists of URLs, by domain and path including regular expressions, by extension (such as .html and .ssi), by MIME type and even by patterns of text in the keywords or the content of the page. It can keep or strip parameters after a question mark in the URL.

For any accessible database with a JDBC interface, the DBWalker will make the connection and generate a page for each item, using an XSLT stylesheet. It can handle Oracle, Microsoft SQL Server, Sybase, PostgreSQL, and Texis, and parameters set the table, columns, key field and filter (WHERE clause). For authentication, it can use either a standard user name and password, or prompt the user for their name and password at search time for better access control. This provides an indexable and therefore searchable interface to the data, rather than requiring a full database web front end.

However, there are no programmatic or automatic tools to connect external metadata (such as author, department or rating) with a content URL.

Scheduling and Speed of Indexing

Webinator uses an adaptive scheduler which checks frequently-changing pages more often than those which never change. The scheduler form provides options for refreshing the index at intervals of once a month to every minute, with many options in between.

By default, the indexer will request new pages as fast as the server can handle it, but it can be set to delay based on the number of pages on the site, for those servers which can't handle the load. Admins can also increase the number of threads per site and the number of concurrent servers to index, to speed indexing. When set at full capacity, the Webinator Appliance can fetch and index up to 300 pages per minute.

A useful feature of Webinator is a "watch URL" - a page which is should contain new URLs added to the site. It's checked every fifteen minutes and new pages indexed when seen.

Security, Access Control and Authentication

Like all modern search engines, Webinator defaults to honoring the robots.txt protocol and robots meta tag. However, given that some webmasters mistakenly set these features, search admins can override the exclusions for a particular profile.

Webinator can crawl sites and access data protected by SSL (Secure Sockets Layer) -- it has a certificate and can decrypt the contents once received. However search admins should be consider the implications: if indexing really sensitive data, they should protect the whole index and search server just as they do the content server.

For access control, Webinator can index and store data protected by HTTP Basic Authentication (user name and password), Windows NTLM, and FTP authentication, and a primer URL which can be a form action with login credentials. It can store login cookies and provides editable cookie source path field, to allow special cookies per site. These allow the indexer to access the information and store it for searching, along with information about its access status. Later on, the search engine can use this to control who sees what.

Webinator has additional authentication options for pages which include embedded objects (such as frames) which are more sensitive, such as a person's bank account.

Duplicate Detection

Web sites and data stores can have duplicate files for all sorts of reasons: editorial mistakes, symbolic links, accidental copying of whole directories, and so on. Webinator will check each page against all those in the index and reject any that are duplicates. By default, it checks the body text, but admins can specify any combination of the title, description, keywords, meta tags, and body.

Content Indexing Process

File formats

Webinator will index text and HTML documents (with any extension), text in XML fields, RTF, tex, PostScript, Microsoft Word, Excel and Powerpoint documents, Wordperfect (may have problems finding wordbreaks), and Adobe Acrobat (PDF).

For PostScript, PDF and other document formats, it can try to recognize page breaks or arbitrary file size breaks and present the document in smaller chunks, while still treating it as one document for retrieval purposes. I think this can be enormously valuable for sites with large documents.

HTML Indexing

Webinator has a wide number of options for defining HTML fields containing text to be indexed, from "All Meta" to very specific fields and some tag attributes, such as image ALT text.

There are several ways to avoid indexing navigation and other standard content which is useless for search. One is to specify a tag or comment pseudo-tag for content to be ignored, such as <!-- rightsidenav-->, or content to be indexed such as <main>. Another is to use the "Remove Common" option, which tracks repeated leading and trailing text and automatically removes it from the index.

Character sets

The Webinator index defaults to Unicode (UTF-8) and will attempt to convert content from its native character set if possible. But for sites with a specific character set, the index charset can be changed.

There's a "CJK" (Chinese/Japanese/Korean) mode which is not described in any detail in the documentation or on the site.

There is no language or character-set detection code.

Tokenization and Stopwords

The default definition is that a word is made up of alpha-numeric characters (including diacritics and extended characters such as é and ß) and is less than 70 characters long. There's also a special case for domain names, to avoid splitting the word on the periods. By making the word definition a regular expression and the "Ignore Characters" editable, each site can define how to deal with special cases, like queries such as i/o.

The exact form of each word is stored in the index, allowing both exact and pluralized queries later. The default is to ignore the stopwords "a an the who", but it's editable and I recommend deleting all stopwords so users can search on whatever they like, such as The Who.

Index Status and Reports

The system provides both interactive near-real-time index reporting, with information about the page size and modification dates, and emailed reports, with detailed status and errors based on the verbosity setting.

The URL Listing area can display all the file paths in the index by Depth (how far from the root), URL (alphabetic order), Newest/Oldest first (date of last update within the index), or Largest/Smallest first (page size, in bytes). Not only can the search admin view the listing and identify problem areas indexed or missing, they can specify a URL path pattern and create a category for those pages, or delete those pages from the index. In addition, by clicking on a URL link, the admin can see a detailed description of that particular as indexed, including modification date, size, download time, parents and children (incoming & outgoing links), errors, metadata, and a stripped-down text version of the page contents. This is very helpful for debugging indexing and tracking down search anomalies.

For a quick way to clean up an index that has some inappropriate contents, the delete option will remove the listed URLs from the index. However, to keep them from being indexed in future walks, the search admin must add those paths to the Exclusions section of the Walk settings.

Search Features - Query, Retrieval, Relevance Ranking

Query Processing

The query processor accepts the user's typed text, tokenizes it (converts it to words), and checks for operators, limits, filters and other instructions for processing. It also expands the query if stemming or synonyms are enabled and spellchecks the terms.

Query Operators and Special Features

Webinator supports the Internet Query Operators: + (plus) for required words, - (minus) for excluded words, , (comma) for optional words, and "" quote marks for exact phrase matches. Like other search systems, these can be enclosed in parentheses to control the order of processing, otherwise it just goes from left to right.

It doesn't support Boolean queries, but uses the alternative "set logic" and the operators described above.

Webinator allows wildcard searching, even in the middle of a word, such as 456*a*def. This powerful feature will solve problems on many sites, such as searching for product codes and substrings.

Additional options for include the "@N" (permute or facet) operator, which is like a special case of the + operator, requiring any two or more of the previous terms in a page for it to be considered a match. This is a powerful but subtle command that most end-users will never understand or care for. The "w/" operator can require the words in the query to be within a line, sentence, paragraph, or page, and is available on the Advanced Search form.

If a user types a natural-language question, Webinator will remove common terms such as "what is" and "how much", to increase the number of items matched.

Search Zones

Webinator offers a simple way of specifying drop-down menu to limit searches to what they term "Categories": zones or sections of an index based on URL pattern. Each index can have zones, and the menu can be automatically included within the search form. Where the path and directory structure of a site aligns to the information architecture, this is a very simple way to allow users to limit sites to their preferred area.

Stemming

Each Webinator collection can use one of three settings for stemming. The "Exact match" option means no stemming at all, "Any word forms" means that it will match any form of a verb or noun, so searching for camp would match camps, camper, camping, camped... By far the most useful is the"Plurals & possessives" setting -- a happy medium that users expect, at least subconsciously.

Synonyms

Webinator can expand a search from one term to multiple other terms as defined in their primary "Thesaurus" file and/or a "User Thesaurus". This can be automatic or invoked when a user types a "~" (tilde) before a word. The search expansion is quite aggressive, matching the term argument with discussion, reasons, case and so on. The default thesaurus is probably too broad, though the user thesaurus would be more valuable. There's no way to identify a preferred term to specify the direction of the expansion: to add ATM to a query for automatic teller, but not the other way around, so it's only a partial solution.

Spellchecking
Instead of using a general-purpose dictionary which would suggest words not found on a site, Webinator adopts the tactic of using the index itself as spelling dictionary, suggesting correct spellings for even the most obscure of words and names. It also shows how many items in the index each one will find, even for multiple misspelled terms. Webinator always suggests three options, even when two are quite different from the original term, but it does do an excellent job with the first suggestions.

Retrieval

By default, Webinator requires that a matching page contain a match for every word in the query, which is generally good for large informational sites. Small sites, and commerce servers, where it's important to give a positive response whenever possible, can change the setting to disable "Require All Words".

The query operators (discussed above) define which pages Webinator will consider matches for the query. If any of the pages are protected by access control beyond a simple password, Webinator can check whether the user is allowed to see them, and suppress any which are not allowed.

There is no upper limit on the number of results found, although for better performance, the search admin can limit the number of results available (after ranking). The default is 200.

Relevance Ranking

Webinator has its own algorithm for sorting by relevance -- it does not use the standard TF:IDF algorithm. By default, it combines scoring for documents for matches with the query terms by proximity, by word order, by word forms (stemming), by frequency in the document and index, by position within the document and by distance from the home page. The search administrator can set each of these to off, low, medium, high, or max, allowing them to tune the search based on user behavior and content available, and users can set these in the advanced search interface.

There is no way to indicate that some sections of the site are more valuable than others: search admins may see that product pages are more likely to be relevant than old press releases, but there's no way to adjust the ranking based on that knowledge.

For sites where the page modification dates are reliable, the sort by date option is quite valuable, though I don't recommend it be set as the default in most cases. Users can click on the sort link in the navigation bar if they want to sort by date.

Results Page User Interface

A site or enterprise search results page must convey two things: the search results themselves, and the context of the search. End-users have a tendency to think every site search is Google, so it's important to use the site look-and-feel, design elements and colors to remind them it's a more targeted search. In addition, there may be site-specific requirements for the search display.

Webinator has a number of different options for customizing results pages, from simple template fields and menu options, a comprehensive XSL stylesheet which can have additional HTML elements and if-then logic, or simply returning XML to the server for specialized formatting. The XSL offers almost every element for localization, but for complete control of the labels and text, the XML is always available.

Page Layout

At its simplest, Webinator allows search administrators to paste the header and footer HTML from the site into fields in the admin interface, and settings for width of the page and default font and character set, so the page design and standard navigation and graphic elements appear around the search results. There's no place to customize the page title in the template settings, though the XSL stylesheet has a variable, so the title can be set to "Example.com Search Results for "relevance", describing the page for bookmarks and other lists.

Results Header & Footer

By default in both the template and XSL layout, the Webinator results header includes a search form including the query terms, which is very important for the UI. It has a link to advanced search and a checkbox to search within results, which I think is unnecessary and confusing. Navigation elements include the number of matches, the number on the page, links to results pages, link to sort by date, and next/previous links.

Spellchecking

A prompt of "Did you mean?" and three spelling suggestions will automatically appear under the search field if there are no matches, and matches are found using the spellchecker. Webinator doesn't use a generic dictionary, as that would suggest many words not found on the indexed site, but adopts the tactic of using the inverted index itself (which is in alphabetical order) as the dictionary. It does a good job of matching the misspelled term to the correct term, including names and made-up words.

Search Suggestions / Best Bets

For cases when the retrieval or relevance fails to display relevant documents for common queries, Webinator offers a manual recommendation system. Any URL can be linked to query terms, a title and a description: when users search on those terms, the suggestions appear at the top and/or right of the results listing, looking much like the text advertising in web search engines. The settings allow admins to specify the label, box color, border style and width, while the XSL stylesheet is completely flexible on the layout.

This implementation of Search Suggestions is fine, but they miss the opportunity to be more flexible about query matching: it would be nice to see wildcards or regular expressions, and a way to specify that query terms should not be matched if they're in phrases. In addition, there's no way to add a link to an offsite URL, so, for example, people searching an intranet for for W-4 can't be directed to the IRS.gov forms page.

Results Items

Webinator allows admins to pick results designs from a list, to format results using their XSL processor, or to get raw XML results and format on their local server. The one setting that's common to all of these is the Results Per Page, usually 10.

One nice feature of Webinator's design is that clicking on a PDF file in search results will attempt to open the file in Acrobat and locate the search terms using the internal query. Another to break up long documents into chunks by page or section, and show them as part of the search result, so a user can locate the matching pages quickly.

Webinator has three special links (which will be overkill for most users)

Settings Results Design Options

The Webinator results templates offer eight design options for search results, from terse to extensive to a familiar Google-like result. Each offers at least the title as a link to the page. Some display the result score as a percent or a partially-filled bar, the last modified date, and the file size, the "depth" (distance from the root). As users rarely see, much less understand, the results scores and depth, I recommend choosing options which do not display these elements. Similarly, choosing options without the special links will make the search results less intimidating and easier to use.

There are also choices for the length and source of the results item information -- it can show the match terms in context from the page, text from the top of the page, text from the start of the content (body), or contents of a meta description tag. In UI tests, the match terms in context provide the best information for users looking at search results.

XSL Results stylesheet

The XSL stylesheet option starts with an complete version of the most extensive layout and allows search admins to display various elements in any position, suppress those which don't apply, and call external functions for elements such as icon photos. It's fairly easy to modify for anyone who understand markup languages or scripting.

My recommendations are to remove the scores, depth, "Find Similar" and "Show Parents", which I don't think provide enough value to users. I'd also put some more information after the URL: the size, the modification date (if the dates are accurate), and the renamed Match Info" to "View as Text".

Uploading and testing XSL changes is slightly awkward, and it would be nice to have the XSL processor display any error messages right after upload, but it does work.

Raw XML Results

Webinator will send all the elements of the search results back, the query, authentication information, the number of items, sort order, next and previous links for results pages, results items including number, title, url, size, time last modified, match terms in context from the page, and more. These elements allow a programmer to re-create every aspect of the Webinator results page, or, more likely, customize the results as appropriate for a particular site.

Search Logs and Reports

There are index metrics with Webinator: how many pages are indexed, errors and duplicates, details of exactly which pages had problems. There don't seem to be any search server metrics, no reports on uptime, traffic per day, averages and peak loads.

Query Reports can be interactively viewed at any time. They consist of the top 100 of each category, sorted by decreasing frequency.

The Query Log in Excel Format exports the current query log as a tab delimited text file, ordered by date, with all the information above including parameters for each query.

Emailed daily or weekly, log destroyed when doing a "New" walk on the index.

This suite of search logs and reports adequate. The basics are here, but there's no way to track clicks in the results pages without a great deal of extra programming, there are no referer reports so no way to know which pages users start their search on, the query expansion isn't exposed, so no way to see the synonyms or stemmed versions of words, and nothing on commonly misspelled words.

Administration

Administration is generally done via a web browser. User accounts can be given access to specified areas of the Administration settings and not to others, so a librarian would be allowed to change many elements in an index but not delete the whole thing. These pages give basic control over the search engine, in some cases, quite extensive and detailed, and there are links to short explanations.

However, the pages are very long, the labels are confusing, and the UI is awkward down to the basics, like using two radio buttons where there could be one checkbox. In addition, the settings are not properly grouped, so controls for exactly what servers will be indexed scattered along the list, as are authentication features. Worst of all, in the Search Settings page, the command to apply the changes is just another checkbox in the middle of a long list of settings -- one has to click it and then click the Update button.

The Webinator software version also has editable configuration files and templates. These have more options and features than the browser administration interface, but are not commonly available on the remote search and appliance.

Conclusion

Webinator, in its many forms, is a powerful but somewhat clunky tool. Its crawler is extremely flexible and configurable, and it, almost uniquely, follows JavaScript links on the Web, and crawl file servers and databases (via JDBC) without additional middleware. It can include or exclude documents or parts of documents based on metadata or contents as well as page URL, divide long documents into virtual pages for better retrieval and viewing. Access control is quite functional for password, NTLM and cookie-based authentication schemes.

Query processing is sophisticated and allows complex searches, although not quite the same as Boolean logic. The stemming, spellcheck and synonym features are nicely calibrated, offering both extended and limited expansions. Relevance ranking is good by default and easily adjusted for sites which have special situations, such as deep directories or very short documents.

The default result pages are not ideal -- clearly no UI work has been done there either-- and the options for results page customization are a bit awkward, but the XSL option is extremely flexible, allowing scripted conditions and insertions, and complete XML results can go anywhere.

The Webinator Appliance is more open and controllable than the Google Search Appliance -- administrators can kill runaway processes and deal with kernel panics as they do otherwise with Linux, without requiring technical support help. Whether software or hardware, indexes can be replicated and the whole system scales to hundreds of thousands of documents and 2,500 queries per minute at peak load (this is huge -- most sites never get more than about 100 queries per minute).

There's a lot to like about Webinator, but the administration interface is awful. It looks as if it was designed by programmers in 1994 who'd never seen a Macintosh, never tested on actual users, and only added to over the years. I have (to my horror) seen worse interfaces, but this ridiculously bad and difficult to use.

All in all, I give Webinator a score of 7.5 out of 10. The price is good, the software is solid, most of the features anyone could need are there, even the documentation's pretty helpful. But there are a few missing features (integration of metadata repositories in indexing, LDAP and other access control systems, Best Bets links to outside URLs, adjustable relevance weighting, search zones based on content rather than URLs) and the administration interface is so disorganized that it is a hindrance rather than a help.

(See also: SearchTools Report on Webinator)


Page Created: 2007-04-24