Home Guide Tools Listing News Background Search About Us

Guide to Search Tools

Report on Distributed Search Systems

Centralized search engines with giant indexes have some significant problems with scaling, resource use and timeliness. Some people propose to replace them with P2P search engines, but these do not handle full-text queries very well (see the SearchTools Peer Search Report). Another approach is to use a meta-search engine, which can send queries to many servers and receive results by pretending to be a human browser (see the SearchTools Meta Search Report).

For situations in which a cooperative group controls several servers, I propose that distributed search protocols provide the best solution to spreading search loads and keeping indexes timely. These would be open standards, agreed on by many search engines, allowing remote querying and results distribution. The disadvantages of such as standard include a perceived loss of proprietary advantages, design by committee resulting in delays in making changes, and a layer of inefficiency as the search engine translates from the standard to its internal formats.

Basic Requirements for Distributed Search

History of Distributed Search

In the early history of the Web, when bandwidth and disk space were much more expensive, the SOIF and RDM distributed indexing technology, was designed for local servers to gather and index data and then pass it on to search servers. This allows indexes to work together and update as needed, rather than forcing each search indexer to crawl each site separately.

Z39.50

Z39.50 is a standard developed for library and other bibliographic databases, this provides a common interface for federated search on a multitude of database formats. It has a standard messaging and wire protocol, which predates HTTP and is much more complex, with a complex session interaction system. Unlike the stateless HTTP protocol, Z39.50 doesn't have tools to deal with unavailable servers, and the system will not return until the slowest server replies. It assumes a "shared content semantic knowledge" -- oriented around library collections. Although the basic functionality is available: send a query, get a results set, get a record from results set, implementers found some important elements undefined and proceeded to use their own interpretation, breaking the interoperability. In addition to all this, the results were not necessarily readable and often require significant post-processing.

Z39.50 References

Z39.50 standard

For an in-depth explanation, see Z39.50: An Overview of Development and the Future.

United States Library of Congress listing of Z39.50 software

ZNG Initiative: Z39.50 Next Generation - new web service based on Z39.50 and the web technologies HTTP, XML, URI and SOAP/RPC.

Z39.50 Made Simple

Z39.50 Examples

Z39.50 Search Tools

WAIS

HARVEST

STARTS

1998 multilingual distributed search article

JXTA - Sun's Distributed Search and P2P Protocols

JXTA is Sun's definition for Peer to Peer communications includes a standard communications protocols for queries, and for distributing queries based on source coverage. The sources advertise their contents with metadata, and peers can choose where to send the search query. Search indexes can be centralized, decentralized, or mixed. The search engines use their own internal algorithms for retrieval and relevance, but the search response

See the SearchTools Coverage of JXTA Search for current information.

XQuery Version 1

 

"Best Sources" Problem

Open Archives Metadata Harvesting Protocol

New Google APIs


Definition of an Ideal Distributed Text Search Protocol

Page Created 2002-02-23

Home
Guide
Tools Listing
News
Search
About Us
SearchTools.com - Copyright © 2001 - 2007 Search Tools Consulting
This work is provided under a Creative Commons Sampling Plus 1.0 License.