Home Guide Tools Listing News Background Search About Us

Search Tools Analysis

Google Search Appliance Review

(Software 3.4, update September 30, 2002)


Basics and Capacity

The Google Search Appliance is not just an application, it's a hardware-software combination that comes in three formsgoogle 1u box:

The software is a version created from the main public Google search engine, and the operating system is their own version of Linux, tuned for being a search server without the overhead of other applications. For debugging purposes, there's a serial port and a modem, so Google support staff can dial in if necessary, but this is controlled by the customer. While it's nicely self-contained, some system administrators may be distressed at their inability to check for security breaches, viruses, disk problems and perform other maintenance and troubleshooting tasks. I also found the 1u unit to be extremely loud.

Administration

The Google Search Appliance makes it easy for a librarian or editor to administer the search engine, it does not require a programmer or system administrator. The browser interface lets admins log in from anywhere, and share control over the collections. If something goes wrong, the server will send email to notify the main admin.

Setting up the system is very easy: you just plug in a laptop to a special Ethernet port on the server and created a mini-network -- it even works with Macs. Using a browser, set the server IP address, DNS, timeserver, administrator's email address, mail server, and similar support information, and then it is ready to go. Note that the email address will be sent as part of the HTTP request and will end up in server logs, so be sure it's not a personal address.

Robot Spider Crawling Features

To locate files for indexing, Google Search Appliance uses the same "robot" spidering system as the public search engine. This starts on a page and follows each link on the page to locate other pages or documents. The administration interface provides fields with multiple URLs and specify which web hosts and domain names the robot is allowed to access, and keep it for crawling inappropriate hosts. Even with a complex setup, wildcards and regular expressions make it easy to control the crawler. In version 3.2, the Pattern Testing Utility shows which URLs match specified patterns, without having to perform a test crawl.

Based on Google's public web experience, the robot does a great job of following links and retrieving documents. It handles session IDs in URLs, and automatically does the right thing when it finds pages in Lotus web server multiview formats. It uses the If-Modified-Since request, so servers can send just changed pages, and provides an override for those servers which have incorrect dates. You can schedule the days of the week, time, and maximum time allowed to index each collection. In version 3.2, they have added an "Archives" attribute, allowing search admins to specify static servers which are crawled less often.

Special Robot Crawler Settings

For each domain, host or specific pattern, you can set the highest number of concurrent connections (so you don't overload servers), as well as the proxy server to use and custom HTTP "request and response" headers. These are all ways in which the search engine might have to interact with the web servers, so it's nice to have them editable. The duplicate hosts field stores mirror server names, so the robot does not try to crawl each individual host when the content is the same. All these are important when using a robot to follow links and locate documents via HTTP. Version 3.2 optimizes for shorter crawl times, and gives priority to searching, for better performance while it's crawling.

In version 3.3, the indexing has some intelligent prioritizing, so it crawls the pages with high PageRank and outbound links more frequently than other pages.

Search engine robots need to identify themselves, so that web server administrators can recognize them in their log files and so they can specify what they can see, using the robots.txt protocol. By default, this search engine identifies itself in the HTTP header and server log files as "gsa-crawler" (and includes the admin email address). While this is editable, the only reason to change it is to accommodate a truly restrictive system that disallows all but one HTTP client (browser).

To the server admin, it will look something like this in the web log:

02/04/02 20:48:16 /index.html 200 GET ""
  
gsa-crawler (Enterprise; GID-01083; search-admin@example.com) 128.0.0.1

Security

The Google Search Appliance can handle several types of system security. Because it can be located in the server room, and it has a static IP address in your block, you can give it access to servers which are not open to the outside. It can also store and send user names and passwords for server realms using Basic Authentication. Version 3.3 adds support for servers running NTLM Challenge/Response authentication, and certificate passing. For sites where the user is authenticated before being allowed to search, it can avoid requiring a login for generally-accessible data. However, it does not integrate with other institutional security systems, access control lists, or file permissions.

When performing searches on restricted information, the Google Search Appliance checks that the current user can access the documents at results time. That means that each document is checked before being displayed, so there's no problem synchronizing access controls.

While the Google Search Appliance can use the HTTPS protocol to get pages sent encrypted with a certificate key, you should not index documents which are truly private data such as patient records, orders, or financial information without further security -- indexing it makes it much more widely accessible.

Non-HTML Documents

The Google Search Appliance can read and index 200 file formats, from text to PDF, Microsoft Word, Excel, WordPerfect and many more obscure formats. While it will index and search the text in the tags of XML documents, there is no way to search attributes or limit searches to specific XML tags. This version has no way to scan file systems, access database records or integrate directly with document management systems or CMSs. To index this information, you'll need to build HTTP interfaces, such as directory listings for file repositories, or automatically generate HTML pages from databases.

Index Status and Reports

Search admins need to know what the robot found when it crawled links on a bunch of hosts, and this search engine has wonderful reporting features. It provides a status report with frequent updates and a really cute animated gif reminding you that it's still working. Both during and after indexing, it provides a very helpful interactive report showing what was indexed and what went wrong. It provides options to see one or many hosts, the URLs, the errors and the successes, so you can really tell what happened when indexing a site.

Google does not do incremental updates in an index, so there's no way to go back and adjust things without starting your entire indexing run again. However, it's possible to remove URLs from the searchable index without reindexing. In version 3.3, you can search on a combination of two collections: a main collection for documents which don't change often, and an incremental collection for those which are often updated, such as breaking news stories. The incremental collection can be indexed continuously, while the main collection index is updated daily or weekly.

Index Content Features

In the process of indexing, the Google Search Appliance converts from any other format, such as PDF or Microsoft Word to HTML, and caches the HTML version for later use. It will index from the cached copies when the original hasn't changed, which is much faster than converting again. As with the public Google search engine, the Appliance tracks anchor text and links to and from pages. Unlike the public version, it also indexes all meta fields in HTML files, including author, description, keywords, generator (inserted by some HTML editors), and Dublin Core tags.

For date display and sorting, the default is to use the last-modified date sent by the web server in the HTTP response. You can override this to use a date in a meta tag, title tag or the first date in the body of the page -- great for documents with standard publication information, so their actual date appears.

URLs can also be organized into zones or virtual "subcollections" for later searching: a corporation might want to allow employees to search by function, such as HR, marketing, and development, or by location. However there is no way to say that some sections, such as product information, are more important than others, such as public discussions or archives.

Recommendations and Synonyms

To improve search results, admins can recommend documents they think are most likely to be useful for specific searches. Google calls this "KeyMatch" and provides a place to type or import a list of queries, URLs and names for the recommendations. Similarly, the synonym list shows a suggested alternate term, such as "physician" when a user types in "doctor", or the preferred "handheld computer" for "PDA". These are clickable query links, and can be combined with KeyMatch links.

Search Options

The Google Search Appliance uses standard Google rules, such as searching for all words and treating upper and lowercase letters the same. It recognizes the minus sign (-) to exclude unwanted words, but does not allow the word NOT. Searchers can use quote marks to search for phrases and the Boolean command OR to specify an alternate term, but not parentheses for faceted search. The Advanced interface allows search in title or URL, and limits to specific domains, and file types, and sort by date, as well as find links to a specific page. Sophisticated searchers can have access to standard Google search features using operators such as inurl:, intitle, site: and link:. In addition, users can search by document type using the filetype: operator, but that will not recognize some files which don't have a three or four-letter type suffix, including directory default pages.

This version automatically indexes and searches all metadata tags and URLs, so it will find everything in the header in the normal search. To search a specific metadata field, such as the Author or Dublin Core Publisher field, the search admin can set up a parameter in the search form and send that to the engine as part of the form action, though it would be nice if these were part of the Advanced Search page. This is quite similar to way to the way that search forms can offer limits by zone and language.

Google Search Appliance recognizes 27 languages: Arabic, Chinese (Traditional & Simplified), Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hebrew, Hungarian, Icelandic, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. This version simply allows searchers to limit their search by language: it does not have special spelling or synonyms for these languages. Light testing shows that the language identification works quite well, although multilingual pages are a problem, as they are for every automated system.

Searching with Google Search Appliance is familiar, with all the features working very much like the web search interface. The company is still working out some kinks, such as what to do about metadata. While a site may not have to worry about search spam text in the keywords field, not all header meta fields should be searched by default. For example, the "generator" field is rarely interesting to searchers, who rarely want to get all the page created in BBEdit. Relevance ranking is generally good but similar to other search engines -- it doesn't have the extra edge of the PageRank algorithm in the public web.

Results Pages

Google Search Appliance default search results look like the public search engine, clean and simple. The search results header includes the search field, search terms, and number of matches, and a suggested alternate spelling, based on the site dictionary, if appropriate. Each search result item has the title, URL, with the size and date if available, and a "snippet" from the document shows the matched term in context whenever possible. Oddly enough, the size and date are often missing, notably with the non-text files such as PDF and Microsoft Word. It provides cached copies and text versions of binary file formats, just like the public Google system.

Documents from the same directory are grouped, to improve the variety of results within a page: there's a clickable link to see more from that directory. Likewise, duplicate documents are indexed, they are hidden unless asked for.

Obviously, it's a good idea to customize the search results for a specific site or Intranet. I strongly recommend using the standard coloring, design elements, and most importantly, site navigation, to search forms and results pages. The search results come in XML format, so an intermediate application can format them. But you don't have to do that, the results layout is completely customizable by editing the XSLT code, and the server will apply it to the XML to generate standard HTML. There is only one XSLT setting on the server, but the query can include a link to an XSLT file on any other server and the Google Search Appliance will use that for formatting search results for a specific host, a section of a server, a language, or other situations. Version 3.2 adds a Page Layout helper for format customization.

It's particularly important to deal with search failures, because that is when your users see the search engine as being a problem. The default display is not very helpful: I recommend that a no-matches page should explain very clearly what is indexed on the site or Intranet. Luckily, the "empty result set" section of the of the XSLT result display allows you to display helpful information and get your users on the right track.

Search Logs and Reports

By viewing the search activity, an administrator can learn a great deal about the needs of those using the site or Intranet. The Google Search Appliance provides both raw search logs, which can be processed using standard log analysis programs, and a search log report showing searches by day, hour and top 100 keywords and queries. By comparing time periods, you can even get a feeling for trends and changes in demand.

Conclusion

The Google Search Appliance is an excellent search engine for HTTP-accessible content, with comprehensive administration tools, wonderful reports, familiar search features and powerful customization options. However, it doesn't have the significant advantages over the competition that the public Google search does: relevance ranking is similar to other high-quality search engines. If you want more control over which metadata to index, crawling schedules or relevance weighting, or you need to integrate with enterprise security systems, textual databases, content or document management systems, this version can't accommodate you. But it's effective, fast and particularly well-priced for Intranets or web sites with millions of documents.

 

Avi Rappoport
SearchTools.com

Note: I tested version 3.0, added version 3.2 and 3.4 update information from company literature

see also: Google Search Appliance Product Information

Page Modified 2002-09-30

Home
Guide
Tools Listing
News
Search
About Us
SearchTools.com - Copyright © 2002-2007 Search Tools Consulting
This work is provided under a Creative Commons Sampling Plus 1.0 License.