see also: the Federated Search report
In the early history of the Web, when bandwidth and disk space were much more expensive, the SOIF and RDM federated indexing technology, was designed for local servers to gather and index data and then pass it on to search servers. This allows indexes to work together and update as needed, rather than forcing each search indexer to crawl each site separately.
Z39.50
Z39.50 is a standard developed for library and other bibliographic databases, this provides a common interface for federated search on a multitude of database formats. It has a standard messaging and wire protocol, which predates HTTP and is much more complex, with a complex session interaction system. Unlike the stateless HTTP protocol, Z39.50 doesn't have tools to deal with unavailable servers, and the system will not return until the slowest server replies. It assumes a "shared content semantic knowledge" -- oriented around library collections. Although the basic functionality is available: send a query, get a results set, get a record from results set, implementers found some important elements undefined and proceeded to use their own interpretation, breaking the interoperability. In addition to all this, the results were not necessarily readable and often require significant post-processing.
Z39.50 References
For an in-depth explanation, see Z39.50: An Overview of Development and the Future.
United States Library of Congress listing of Z39.50 software
ZNG Initiative: Z39.50 Next Generation - new web service based on Z39.50 and the web technologies HTTP, XML, URI and SOAP/RPC.
Z39.50 Examples
- Library of Congress gateways to many library catalogs
- Global Information Locator Service (GILS)
- self-description of content
- mandated for government sites
- 60 out of tens of thousands complied
- SILO evaluation is mixed
- MUSE Global both Z39.50 & metasearch
Z39.50 Search Tools
Amberfish (Z39.50) IB 2.0 WebCat (Z39.50) Isearch (Z39.50)- MetaStar from Blue Angel Technologies (Z39.50 etc.)
MPS Information Server (Z39.50)- Webcat (Z39.50)
- Zebra (Z39.50)
WAIS
- Accesses multiple database indexes, in WAIS format
- Based on Z39.50
- Automatic or user-defined search targets
- Includes a relevancy ranking in the results it produces.
- Major problems with speed
HARVEST
- proof of concept
- broker talk multiple gatherers
- vision of topical servers
- trees: brokers talk to other brokers, which talk to gatherers
- gatherers expect good metadata
- not terribly robust
- designed to scale
- distributing the gatherers (use grid computing), avoid centralized servers
- not portable across OSs: unix executables & perl
- summaries not full text
- desktop gatherers
- browser admin
- most sites used it for centralized search index
- programmers go to netscape & athome
1998 multilingual federated search article
JXTA - Sun's Federated Search and P2P Protocols
JXTA is Sun's definition for Peer to Peer communications includes a standard communications protocols for queries, and for distributing queries based on source coverage. The sources advertise their contents with metadata, and peers can choose where to send the search query. Search indexes can be centralized, decentralized, or mixed. The search engines use their own internal algorithms for retrieval and relevance, but the search response
See the SearchTools Coverage of JXTA Search for current information.XQuery Version 1
- W3C recommendation
- Field-oriented query language
- Has a full-text search example
- Pretty close to SQL with for and where clauses
- Sorts by various fields or expressions, but not relevance
- Included in SOAP
"Best Sources" Problem
- Hundreds or thousands of sources
- Can't query them all, performance problems
- Try to get users to choose domains
- Ray Larson, UC Berkeley
- Perform scanning of databases to list contents
- Choose based on the best matches
- Intelliseek
- Analyze query for context
- Find best sources for query type
- Choose best-performing sources based on responsiveness & quality
- Jon Kleinberg - social networks
- Finding experts
Open Archives Metadata Harvesting Protocol
- Standards for efficient dissemination of content.
- Not just for archives!
- Simpler than Z39.50 - based on Harvest protocols and experience with union catalogs
Federated search services
- servers gather metadata via HTTP, normalize, remove dupes, etc.
- user queries service via HTTP
Dublin Core is the common metadata format
- others must be transportable in XM L
- example: e-print archives include author affiliation, journal name, etc.
Does not cover authentication, trust, security, use policies, etc. Does not do fulltext search automatically (see: Metadata Harvesting and the Open Archives Initiative)
New Google APIs
- SOAP, WSDL format
- Java, .NET, Perl, etc.
- Pass Google-style queries including search terms, filters, languages
- Get result as binary containing Google content, sorted by relevance
[
(from Rael Dornfest article)
URL = "http://www.oreillynet.com/~rael/"
Title = "raelity bytes"
Snippet = " ... that's not actually me. "They say
Vorilhon, who calls himself the prophet Rael and
testified before Congress last year in a futuristic
white jumpsuit ..."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "35k"
Related information present = true
Host Name = ""
],- Defining a new standard?
- Semantic Caching of Web Queries for distributed search
- Distributing Query Processing
- Query Routing for Web Search Engines: Architecture and Experiments 9th International Web Conference, Satsushi Sugiura and Oren Etzioni
Attempting to solve the 'best source' problem, this paper describes an automated query-routing system to locate the best specialized search engine for any particular query.