As of January, 2012, this site is no longer being updated, due to work and health issues

Search Tools Analysis

Google Search Appliance and Mini Appliance Review

(Version 4.6.4 software, as of August, 2007)


Basic Capacity

The Google Search Appliance is not just an application, it's a hardware-software combination that comes in four forms:

Google Mini Appliance: 1U blue server box, can search up to 300,000 documents.
1 year with tech support: $2,000 for up to 50,000 documents; $3,000 for up to 100,000; $6,000 for up to 200,00; $9,000 for up to 300,000. Additional single year of tech support: $1000.

GB-1001: A 2U server that can fit in any standard server rack, can search up to 3 million documents at 25 queries per second.
2007 pricing: $30,000 for up to 500,000 documents and two years of tech support, no additional pricing given.
 
GB-5005: A rack of five of the small servers, which can index up to 10 million documents. It comes with a power supply and automatic internal clustering and failover for built-in redundancy.
20067 pricing not given: (old pricing: $230,000)

GB-8008: A multi-server standalone rack with its own power supply, load-balancing and distributed search functionality, which can index up to 30 million documents.
2007 pricing not given (old pricing: $450,000 for an an 8u server rack with secure system crawling, additional load balancing features )

The appliance operating system is Google's own version of Linux, tuned for supporting a search server without the overhead of other applications. The appliance is sealed and breaking the seal breaks the license and forfeits future technical support. For debugging purposes, there's an ethernet port, so Google support staff can SSL in if necessary, but this is controlled by the customer. While it's nicely self-contained, some system administrators may be distressed at their inability to check for security breaches, viruses, disk problems and perform other maintenance and troubleshooting tasks. I and other reviewers have found the 1u unit to be extremely loud.

Administration

The Google Search Appliance makes it easy for a librarian or editor to administer the search engine, it does not require a programmer or system administrator. The browser interface lets admins log in from anywhere, and share control over the collections. If something goes wrong, the server will send email to notify the main admin.

Setting up the system is very easy: you just plug in a laptop to a special Ethernet port on the server and created a mini-network -- it even works with Macs. Using a browser, set the server IP address, DNS, timeserver, administrator's email address, mail server, and similar support information, and then it is ready to go. Note that the email address will be sent as part of the HTTP request and will end up in server logs, so be sure it's not a personal address.

Administration can be shared and delegated, allowing product managers or intranet webmasters access to certain collections and/or interface features.

Finding Searchable Content - Web, File System, and Databases Access

To locate web files for indexing, Google Search Appliances use the same "robot" spidering system as the public search engine, and most other enterprise search software. This starts on a page and follows each link on the page to locate other pages or documents.

To index local file shares and network volumes, they can use directory browsing (aka "web-enabled file system") using Microsoft IIS, or the file system protocols SMB and CIFS. Note that file system crawling does not include security and access controls.

To push content or or associate metadata with web pages, the larger GSAs, but not the Mini, can accept feeds in a standard XML or HTML format.

To index other structured content, the larger GSAs have both a database crawler and a connector framework, allowing programmatic access if nothing else works. When a search user clicks on a result, it must send a URL to a server of some kind, which can then either respond with HTML or open a helper application to show the search result.

Robot Crawler Settings

To locate web files for indexing, Google Search Appliance uses the same "robot" spidering system as the public search engine (without having to worry about search spam pages). This starts on a page and follows each link on the page to locate other pages or documents, much like most other enterprise search software. . The administration interface provides fields with multiple URLs and specify which web hosts and domain names the robot is allowed to access, and to exclude pages, directories or whole hosts from the indexing crawl. Even with a complex setup, wildcards and regular expressions make it easy to control the crawler. A useful interactive Pattern Testing Utility shows which URLs match specified patterns, without having to perform a test crawl.

This crawler does a great job of following links and retrieving documents. It handles redirects, odd relative links, odd characters in file names, frames, many kinds of session IDs in URLs, and automatically does the right thing in many cases, for example, when it finds pages in Lotus web server multiview formats. The crawler ("gsa-crawler") follows the instructions in the robots.txt and robots meta tag, honoring the site crawling preferences. However, it is unable to follow Java or JavaScript links, or documents generated dynamically with JavaScript.

The admin can schedule the days of the week, time, and maximum time allowed to crawl, or enable continuous crawling with intelligent prioritizing, so it crawls the pages with high PageRank and outbound links more frequently than other pages. The admin can can set the highest number of concurrent connections (so the indexing process doesn't overload servers), as well as the proxy server to use and custom HTTP "request and response" headers. These are all ways in which the search engine might have to interact with the web servers, so it's nice to have them editable. The duplicate hosts field stores mirror server names, so the robot does not try to crawl each individual host when the content is the same. All these are important when using a robot to follow links and locate documents via HTTP.

File System Crawling

The IIS / or with URLs using the "smb://" protocol, because it doesn't really recognize the UNC pathname format (though it may do that in the future). The shared network volume can be on Windows, Unix, Linux or Mac OS X. It can

Feeds

Database Access

Security

The Google Search Appliance can handle several types of system security. Because it can be located in the server room, and it has a static IP address in your block, you can give it access to servers which are not open to the outside. It can also store and send user names and passwords for server realms using Basic Authentication. Version 3.3 adds support for servers running NTLM Challenge/Response authentication, and certificate passing. For sites where the user is authenticated before being allowed to search, it can avoid requiring a login for generally-accessible data. However, it does not integrate with other institutional security systems, access control lists, or file permissions.

When performing searches on restricted information, the Google Search Appliance checks that the current user can access the documents at results time. That means that each document is checked before being displayed, so there's no problem synchronizing access controls.

While the Google Search Appliance can use the HTTPS protocol to get pages sent encrypted with a certificate key, you should not index documents which are truly private data such as patient records, orders, or financial information without further security -- indexing it makes it much more widely accessible.

Indexing Document Contents

Non-HTML Documents

The Google Search Appliance can read and index 200 file formats, from text to PDF, Microsoft Word, Excel, WordPerfect and many more obscure formats. While it will index and search the text in the tags of XML documents, there is no way to search attributes or limit searches to specific XML tags. This version has no way to scan file systems, access database records or integrate directly with document management systems or CMSs. To index this information, you'll need to build HTTP interfaces, such as directory listings for file repositories, or automatically generate HTML pages from databases.

Index Status and Reports

Search admins need to know what the robot found when it crawled links on a bunch of hosts, and this search engine has wonderful reporting features. It provides a status report with frequent updates and a really cute animated gif reminding you that it's still working. Both during and after indexing, it provides a very helpful interactive report showing what was indexed and what went wrong. It provides options to see one or many hosts, the URLs, the errors and the successes, so you can really tell what happened when indexing a site.

Google does not do incremental updates in an index, so there's no way to go back and adjust things without starting your entire indexing run again. However, it's possible to remove URLs from the searchable index without reindexing. In version 3.3, you can search on a combination of two collections: a main collection for documents which don't change often, and an incremental collection for those which are often updated, such as breaking news stories. The incremental collection can be indexed continuously, while the main collection index is updated daily or weekly.

Index Content Features

In the process of indexing, the Google Search Appliance converts from any other format, such as PDF or Microsoft Word to HTML, and caches the HTML version for later use. It will index from the cached copies when the original hasn't changed, which is much faster than converting again. As with the public Google search engine, the Appliance tracks anchor text and links to and from pages. Unlike the public version, it also indexes all meta fields in HTML files, including author, description, keywords, generator (inserted by some HTML editors), and Dublin Core tags.

For date display and sorting, the default is to use the last-modified date sent by the web server in the HTTP response. You can override this to use a date in a meta tag, title tag or the first date in the body of the page -- great for documents with standard publication information, so their actual date appears.

URLs can also be organized into zones or virtual "subcollections" for later searching: a corporation might want to allow employees to search by function, such as HR, marketing, and development, or by location. However there is no way to say that some sections, such as product information, are more important than others, such as public discussions or archives.

Recommendations and Synonyms

To improve search results, admins can recommend documents they think are most likely to be useful for specific searches. Google calls this "KeyMatch" and provides a place to type or import a list of queries, URLs and names for the recommendations. Similarly, the synonym list shows a suggested alternate term, such as "physician" when a user types in "doctor", or the preferred "handheld computer" for "PDA". These are clickable query links, and can be combined with KeyMatch links.

Search Options

The Google Search Appliance uses standard Google rules, such as searching for all words and treating upper and lowercase letters the same. It recognizes the minus sign (-) to exclude unwanted words, but does not allow the word NOT. Searchers can use quote marks to search for phrases and the Boolean command OR to specify an alternate term, but not parentheses for faceted search. The Advanced interface allows search in title or URL, and limits to specific domains, and file types, and sort by date, as well as find links to a specific page. Sophisticated searchers can have access to standard Google search features using operators such as inurl:, intitle, site: and link:. In addition, users can search by document type using the filetype: operator, but that will not recognize some files which don't have a three or four-letter type suffix, including directory default pages.

This version automatically indexes and searches all metadata tags and URLs, so it will find everything in the header in the normal search. To search a specific metadata field, such as the Author or Dublin Core Publisher field, the search admin can set up a parameter in the search form and send that to the engine as part of the form action, though it would be nice if these were part of the Advanced Search page. This is quite similar to way to the way that search forms can offer limits by zone and language.

Google Search Appliance recognizes 27 languages: Arabic, Chinese (Traditional & Simplified), Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hebrew, Hungarian, Icelandic, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. This version simply allows searchers to limit their search by language: it does not have special spelling or synonyms for these languages. Light testing shows that the language identification works quite well, although multilingual pages are a problem, as they are for every automated system.

Searching with Google Search Appliance is familiar, with all the features working very much like the web search interface. The company is still working out some kinks, such as what to do about metadata. While a site may not have to worry about search spam text in the keywords field, not all header meta fields should be searched by default. For example, the "generator" field is rarely interesting to searchers, who rarely want to get all the page created in BBEdit. Relevance ranking is generally good but similar to other search engines -- it doesn't have the extra edge of the PageRank algorithm in the public web.

Results Pages

Google Search Appliance default search results look like the public search engine, clean and simple. The search results header includes the search field, search terms, and number of matches, and a suggested alternate spelling, based on the site dictionary, if appropriate. Each search result item has the title, URL, with the size and date if available, and a "snippet" from the document shows the matched term in context whenever possible. Oddly enough, the size and date are often missing, notably with the non-text files such as PDF and Microsoft Word. It provides cached copies and text versions of binary file formats, just like the public Google system.

Documents from the same directory are grouped, to improve the variety of results within a page: there's a clickable link to see more from that directory. Likewise, duplicate documents are indexed, they are hidden unless asked for.

Obviously, it's a good idea to customize the search results for a specific site or Intranet. I strongly recommend using the standard coloring, design elements, and most importantly, site navigation, to search forms and results pages. The search results come in XML format, so an intermediate application can format them. But you don't have to do that, the results layout is completely customizable by editing the XSLT code, and the server will apply it to the XML to generate standard HTML. There is only one XSLT setting on the server, but the query can include a link to an XSLT file on any other server and the Google Search Appliance will use that for formatting search results for a specific host, a section of a server, a language, or other situations. Version 3.2 adds a Page Layout helper for format customization.

It's particularly important to deal with search failures, because that is when your users see the search engine as being a problem. The default display is not very helpful: I recommend that a no-matches page should explain very clearly what is indexed on the site or Intranet. Luckily, the "empty result set" section of the of the XSLT result display allows you to display helpful information and get your users on the right track.

Search Logs and Reports

25 queries per second

By viewing the search activity, an administrator can learn a great deal about the needs of those using the site or Intranet. The Google Search Appliance provides both raw search logs, which can be processed using standard log analysis programs, and a search log report showing searches by day, hour and top 100 keywords and queries. By comparing time periods, you can even get a feeling for trends and changes in demand.

Conclusion

The Google Search Appliance is an excellent search engine for HTTP-accessible content, with comprehensive administration tools, wonderful reports, familiar search features and powerful customization options. However, it doesn't have the significant advantages over the competition that the public Google search does: relevance ranking is similar to other high-quality search engines. If you want more control over which metadata to index, crawling schedules or relevance weighting, or you need to integrate with enterprise security systems, textual databases, content or document management systems, this version can't accommodate you. But it's effective, fast and particularly well-priced for Intranets or web sites with millions of documents.

 

Avi Rappoport
SearchTools.com

Note: I tested version 3.0, added version 3.2 and 3.4 update information from company literature

see also: Google Search Appliance Product Information

Page Modified 2002-09-30

Home
Guide
Tools Listing
News
Search
About Us
SearchTools.com - Copyright © 2002-2008 Search Tools Consulting
This work is provided under a Creative Commons Sampling Plus 1.0 License.