Centralized search engines with giant indexes have some significant problems with scaling, resource use and timeliness. Some people propose to replace them with P2P search engines, but these do not handle full-text queries very well (see the SearchTools Peer Search Report). Another approach is to use a meta-search engine, which can send queries to many servers and receive results by pretending to be a human browser (see the SearchTools Meta Search Report).
For situations in which a cooperative group controls several servers, I propose that distributed search protocols provide the best solution to spreading search loads and keeping indexes timely. These would be open standards, agreed on by many search engines, allowing remote querying and results distribution. The disadvantages of such as standard include a perceived loss of proprietary advantages, design by committee resulting in delays in making changes, and a layer of inefficiency as the search engine translates from the standard to its internal formats.
Basic Requirements for Distributed Search
- Standard transport mechanism (HTTP or SOAP)
- Basic interaction: send a query, get a result list
- Basic Query syntax
- Single relevance scoring range (something like 1 to 100)
- Results with meaningful XML tags, so it's easy to merge and present them
History of Distributed Search
In the early history of the Web, when bandwidth and disk space were much more expensive, the SOIF and RDM distributed indexing technology, was designed for local servers to gather and index data and then pass it on to search servers. This allows indexes to work together and update as needed, rather than forcing each search indexer to crawl each site separately.
Z39.50 is a standard developed for library and other bibliographic databases, this provides a common interface for federated search on a multitude of database formats. It has a standard messaging and wire protocol, which predates HTTP and is much more complex, with a complex session interaction system. Unlike the stateless HTTP protocol, Z39.50 doesn't have tools to deal with unavailable servers, and the system will not return until the slowest server replies. It assumes a "shared content semantic knowledge" -- oriented around library collections. Although the basic functionality is available: send a query, get a results set, get a record from results set, implementers found some important elements undefined and proceeded to use their own interpretation, breaking the interoperability. In addition to all this, the results were not necessarily readable and often require significant post-processing.
Z39.50 References
For an in-depth explanation, see Z39.50: An Overview of Development and the Future.
United States Library of Congress listing of Z39.50 software
ZNG Initiative: Z39.50 Next Generation - new web service based on Z39.50 and the web technologies HTTP, XML, URI and SOAP/RPC.
Z39.50 Examples
- Library of Congress gateways to many library catalogs
- Global Information Locator Service (GILS)
- self-description of content
- mandated for government sites
- 60 out of tens of thousands complied
- SILO evaluation is mixed
- MUSE Global both Z39.50 & metasearch
Z39.50 Search Tools
Amberfish (Z39.50) IB 2.0 WebCat (Z39.50) Isearch (Z39.50)- MetaStar from Blue Angel Technologies (Z39.50 etc.)
MPS Information Server (Z39.50)- Webcat (Z39.50)
- Zebra (Z39.50)
WAIS
- Accesses multiple database indexes, in WAIS format
- Based on Z39.50
- Automatic or user-defined search targets
- Includes a relevancy ranking in the results it produces.
- Major problems with speed
HARVEST
- proof of concept
- broker talk multiple gatherers
- vision of topical servers
- trees: brokers talk to other brokers, which talk to gatherers
- gatherers expect good metadata
- not terribly robust
- designed to scale
- distributing the gatherers (use grid computing), avoid centralized servers
- not portable across OSs: unix executables & perl
- summaries not full text
- desktop gatherers
- browser admin
- most sites used it for centralized search index
- programmers go to netscape & athome
1998 multilingual distributed search article
JXTA - Sun's Distributed Search and P2P Protocols
JXTA is Sun's definition for Peer to Peer communications includes a standard communications protocols for queries, and for distributing queries based on source coverage. The sources advertise their contents with metadata, and peers can choose where to send the search query. Search indexes can be centralized, decentralized, or mixed. The search engines use their own internal algorithms for retrieval and relevance, but the search response
See the SearchTools Coverage of JXTA Search for current information.XQuery Version 1
- W3C recommendation
- Field-oriented query language
- Has a full-text search example
- Pretty close to SQL with for and where clauses
- Sorts by various fields or expressions, but not relevance
- Included in SOAP
"Best Sources" Problem
- Hundreds or thousands of sources
- Can't query them all, performance problems
- Try to get users to choose domains
- Ray Larson, UC Berkeley
- Perform scanning of databases to list contents
- Choose based on the best matches
- Intelliseek
- Analyze query for context
- Find best sources for query type
- Choose best-performing sources based on responsiveness & quality
- Jon Kleinberg - social networks
- Finding experts
Open Archives Metadata Harvesting Protocol
- Standards for efficient dissemination of content.
- Not just for archives!
- Simpler than Z39.50 - based on Harvest protocols and experience with union catalogs
Federated search services
- servers gather metadata via HTTP, normalize, remove dupes, etc.
- user queries service via HTTP
Dublin Core is the common metadata format
- others must be transportable in XM L
- example: e-print archives include author affiliation, journal name, etc.
Does not cover authentication, trust, security, use policies, etc. Does not do fulltext search automatically (see: Metadata Harvesting and the Open Archives Initiative)
New Google APIs
- SOAP, WSDL format
- Java, .NET, Perl, etc.
- Pass Google-style queries including search terms, filters, languages
- Get result as binary containing Google content, sorted by relevance
[
(from Rael Dornfest article)
URL = "http://www.oreillynet.com/~rael/"
Title = "raelity bytes"
Snippet = " ... that's not actually me. "They say
Vorilhon, who calls himself the prophet Rael and
testified before Congress last year in a futuristic
white jumpsuit ..."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "35k"
Related information present = true
Host Name = ""
],- Defining a new standard?
Definition of an Ideal Distributed Text Search Protocol
- Agreement on transport mechanism (HTTP is easiest)
- Quick handshake and setup
- Include a lightweight session identification scheme
- Coverage
- Optionally provide a summary of contents: most pertinant keywords
- If a taxonomy or classification scheme is in use, a reference to it
- On request, provide a list of words and frequency (much smaller than a full inverted index)
- Query syntax
- Support Boolean query syntax
- Recognize a phrase qualifier such as ADJ or quote marks
- Support Unicode for multilingual queries
- Reference a standard set of field names, such as Dublin Core
- Allow end-user security identification
- Search Interactions
- Return the total number of matching results, very quickly
- Return the hits for each query term (could be an estimate)
- Option to use a standard relevance algorithm, such as TF-IDF, for consistency
- Results
- Return in XML and Unicode
- Reference a standard DTD for search results
- Return results in relevance order, with a score