Federated search provides a user interface that sends a defined query to several separate search servers, then accepts and displays the structured results. This is the only way to get information from external sources such as a government patent office, Nexis/Lexis, or corporate databases that can't be crawled. It also reduces the need to crawl and index rarely-used sources, and keeps the index size smaller. Because the query to the content server is dynamic, it's always current on security: users can only see the content that they have rights to access.
However, federated search administrators need to create and maintain query translators from the user-oriented search to the source query language, including passing credentials. They are dependent on the response speed of the servers they are querying, and they have to decide whether to interleave results from multiple servers which may have wildly varying relevance scoring systems. Some target databases have overlapping content indexed, in which case deleting the duplicate results is another task.
Alternatives to Federated Search:
Metasearch: uses connector code to send a query as a web browser client and screen-scrapes the results. This is necessary when legacy systems only have an HTTP interface or are too expensive to adjust, or external content providers decline to add an API (see the SearchTools Meta Search Report).
Aggregated Indexing: attempts to index every possible piece of text, using HTTP crawlers for web or intranet pages, file server indexers, and connectors to content management systems, databases, legacy systems and other data silos.
Many enterprise search installations will use several approaches at once, especially at ever-larger scale. In all cases, security and access control is a big problem and there are no easy answers.
Z39.50 standard was developed for library catalogs and adapted for some databases. It was awkward and state-ful, so the system would not return until the slowest server replies.
SRW/U standard (Search/Retrieve WebService/URL) was created around 2005 to replace Z39.50 as a library-oriented query protocol.
OpenSearch - a protocol developed by Amazon for A9, it's mostly used for choosing which search engine to use in browsers.
Avi's Definition of an Ideal Federated Text Search Protocol
- Agreement on transport mechanism (HTTP is easiest)
- Quick handshake and setup
- Include a lightweight session identification scheme
- Coverage
- Optionally provide a summary of contents and most pertinent keywords
- If a taxonomy or classification scheme is in use, a reference to it
- On request, provide a list of words and frequency (much smaller than a full inverted index)
- Query syntax
- Support Boolean query syntax
- Recognize a phrase qualifier such as ADJ and quote marks for phrases
- Support Unicode
- Reference a standard set of field names, such as Dublin Core
- Recognize end-user access credentials
- Option to use a standard relevance algorithm, such as TF-IDF, for consistency
- Search Interactions
- Return the total number of matching results for the whole query
- Return the hits for each query term (could be an estimate)
- Results
- Return in XML and Unicode
- Reference a standard DTD for search results
- Return results in relevance order, with a score from 1 to 100