As of January, 2012, this site is no longer being updated, due to work and health issues

Search Terms Glossary


See also the Markup and Formatting Languages Glossary. Many terms added in September, 1998. For additional terms, we recommend checking the the Glossary for Information Retrieval, the Modern Information Retrieval (book) Glossary, and the Free Online Dictionary of Computing.


Adjacent Searching: see Proximity

Begins-With Partial Word Matching

Some search engines will match indexed words that contain a search term at the beginning -- this is a form of partial-word matching. For example, if you're searching for "rose", the following rules apply:
rose is an exact match 
roses is a begins-with partial word match 
roseola is a begins-with partial word match
arose is a partial word match but not a begins-with match

Bibliometric Analysis: see Link Tracking

Boolean Search

A form of logical comparison first described by George Boole (hence the name).

Boolean operators let you define whether multiple search term are matched within a text block (usually a web page). A Boolean expression is constructed by joining terms together with the three special operators: AND, OR, and NOT.

Some search forms allow searchers to indicate that all the terms in a query must be matched (AND) or that a match on any one of them is enough (OR). Others match pages on any of the terms (OR) and require the searcher to add a plus sign (+) to indicate that a term is required. Still others allow the Boolean terms themselves in the query text.

The AND operator is the most confusing, because it does not match people's expectations. If you say "I like to eat sushi and chocolate", you are saying that you like both foods, but not necessarily at the same time. This is more like the OR operator. The AND operator includes an implied "together". 

Categorization: see Classification

Click Tracking

Tracking user clicks on results lists allows search engines to identify the most attractive items and move them up in the rankings. It is only appropriate for very large search engines, since it relies on aggregates of user behavior.
 
DirectHit provides this service, tracking clicks in two ways:

Classification

The process of organizing pieces of information into topical categories, like the Yahoo listings. Usually, these are hierarchical trees, with the most general topics at the top and the most specific at the bottom. A department store might have "Products - Shoes - Women - Cross-Trainers", while a gardening site might have a category "Plants - Flowers - California Natives - Poppies". In either case, a searcher can understand more about the content of the page when they know the category. Some classification products will attempt to classify data automatically, while others assist human catalogers.

Date Range

Some search engines provide an option to search for documents modified on a specific date, before a date, after a date, or between two dates. While there are many interfaces to this feature, the most successful ones have included a popup menu with easy-to-understand date options, like this:

Directories

Lists of pages classified into useful categories, (like Yahoo or Looksmart).

Exact Match

Some search engines will only match query terms to document words exactly: they will not allow partial-word matches, fuzzy matching or stemming.

Faceted Search

Some search engines allow searchers to combine terms with Boolean operators and parentheses to make extremely powerful faceted searches, for example:

( (red OR vermilion OR scarlet OR garnet) AND (rose OR rosa OR rosebush) AND (bareroot OR bare root) ) NOT hybrid

should return only information about non-hybrid bareroot red-colored roses.

False Drop

A page returned as a matching result that isn't actually relevant to the search. In most cases, this is because words can be used in very different ways. For example, the word argument means "heated debate" in general usage, but in programming, it means "value passed to a subroutine or other application".

Occasionally, a site index or database gets corrupted and loses track of the search terms and their associated pages. In those cases, you should remove the index files and re-index the site from scratch.

The story behind this term: many years ago, librarians were trying to index book records without computers. One scheme involved cards with holes punched or filled on the edges of the cards, indicating the appropriate keywords. To search, a librarian ran a set of thin rods, corresponding to the desired search terms, gently through the holes in a stack of cards. Cards which were not punched in a particular position did not let the rods go through. Then the searcher picked up the stack of cards by the rods. The cards with holes in those positions rose up, leaving the others below. The items which technically matched but were not actually relevant became known as false drops.

Fuzzy Matching

Exact matching is very strict: either a word matches or it doesn't. An attempt to improve search recall by matching more than the exact word: fuzzy matching techniques try to reduce words to their core and then match all forms of the word.

Some algorithms for fuzzy matching use the understanding that the beginning and end of English words are more likely to change than the center, so they count matching letters and give more weight to words with the matching letters in the center than at the edges. Unfortunately, this can sometimes bring up results that make very little sense in meaning, but simply might match some of the letters (a search for tivoli might bring up laxative).

Index File

Created by the Search Indexer program, this file stores the data from your site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of your site, this file can become very large. It file must be updated often, or it will become unsynchronized with the pages and provide obsolete results.

Intranet

An internal institutional computer network using standard protocols, rather than proprietary or special applications. This allows employees to use standard Web browsers to locate information and HTML editors to publish their data. Many Intranet portals implement search engines which index multiple internal servers to provide access to all available information.

Link Tracking

A form of indexing which tracks the links (aka "citations") to a document. So you could get all the pages which link to a particular page, or sort the results by the number of links. There have been printed citation indexes around for years: the Science, Social Science, and Arts and Humanities Citation Indexes are produced by the Institute for Scientific Information and are in many large libraries, but are not freely searchable online. Lycos, Google and other search engines do this already

Legacy Data

Information in older file formats is known a legacy data. This can refer to very old files which are accessible only through special reader programs, or any non-web-native formats such as word processing, spreadsheets or graphics.

Match

When an indexed page contains the same text as a search term, it matches that term. Some search engines allow partial-word and begins-with matches as well as exact matches, while others extend the search terms by using fuzzy matching and stemming.

When a page fits all the requirements of a query, it matches that query. So a page could match one or more of the search terms, but still not match the query as a whole.

More Like This

Some search engines will allow searchers refine searches and locate additional documents that apply to their question by allowing them to identify documents and then locating similar documents. This allows the searcher to control the direction of the search and focus on the most fruitful lines of inquiry. Using a whole document as a query allows the engine to use sophisticated matching techniques which are inappropriate for very short queries.

Multi-Site Search

Some search engines will index and search pages on more than one server. This may be on a different host machine (www.example.com vs. products.example.com), or a different domain altogether (www.domain.edu). There may be one index for all sites, or separate indexes which are searched at the same time and the results collated.

Natural Language Processing (NLP)

To avoid forcing searchers to memorize Boolean or other query languages, some systems allow them to type in a question, and use that as the query. The simplest processing just removes stopwords and uses a vector search or other statistical approach. Some sophisticated systems try to extract concepts using linguistic analysis, and match those against concepts extracted by the indexer. Others try to categorize the form of the question and use it to define the query, so "who is" questions are not treated the same as "how many" or "why": a good example of this approach is the AskJeeves system. For more information, see the Information Retrieval page.

Near: see Proximity Searching

Ontology

Used in Information Retrieval and Artificial Intelligence, an ontology defines concepts, providing a way to move towards consistency in vocabulary. It provides a working model of the entities and interactions a particular topic, such as dentistry or anthropology.

"In philosophy, an ontology is a theory about the nature of existence, of what types of things exist; ontology as a discipline studies such theories. Artificial-intelligence and Web researchers have co-opted the term for their own jargon, and for them an ontology is a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules." - Tim Berners-Lee, The Semantic Web, Scientific American May 2001.

See also Classification and Taxonomy and the whatis.com entry.

Portal

A Web site that provides multiple services with the goal of becoming the main site for a wide variety of users. Most of the large public portals (Excite, Lycos, Yahoo) grew out of large public search engines, and provide multi-site search services as well as organized information links, email, news, and other features.

Intranet portals, sometimes called Enterprise Information Portals, provide similar services for institutions,

Vertical portals, or vortals, are more specialized, covering a specific topic such as an industry or hobby.

Query

The query is the combination of the word or words used for searching, and any options allowed by the search engine. For example:
term="rose AND red NOT hybrid" limit="whole words" date="since -30 days"
is a complex query telling a search engine the words to look for, their relationship, that they cannot be partial words, and that the page must be dated within the last 30 days.
 
Most queries on Web search engines are very simple: one to three words. Only about 3% of the searches ever use any of the advanced features such as Boolean searching or date ranges. Before the Web, information retrieval was oriented towards librarians and other professionals who used these features extensively. Therefore, many older search engines provide sophisticated control and excellent recall, while newer systems tend to improve results rankings by improving adding link tracking, more like this,

Parentheses

Some search engines let searchers group their words, especially in Boolean searches, using parentheses "()". See Faceted Searching for examples.

Partial Word Matching

Some search engines will only match exact text, others will match the beginnings of words, while still others will match the search terms anywhere within an indexed word: these are partial words. For example, if you're searching for "rose":
rose is an exact match 
roses is a begins-with partial word match 
roseola is a begins-with partial word match
arose is a partial word match but not a begins-with match
 

Phrase Searching

Some search engines provide an option to search a set of words as a phrase, either by typing in quote marks (""), by using a command or clicking a button. When they receive this kind of search, the engines will generally locate all words that match the search terms, then discard those which are not next to each other in the correct order. To perform this task, the index must store the position of the word in the document, so the search engine can tell where the words are located.

Proximity Searching

An extension to Boolean searching, this technique checks the position of terms and only matches those within the specified distance. It's a good way to cut down the irrelevant matches and get better results. Some search engines let you define the order, in addition to the distance. For example: rose w10 heritage might mean "heritage within the 10 words following the word rose". Search engines may also allow phrase searching, where the words must match the query term in the exact order.

Results List

The web pages which match a query. The search engines creates the list by checking its index and saving the matching entries. Then it sorts them, usually according to its relevance algorithm, and generates an HTML page to display the results.

The simplest results listing is just the page title and a URL link to the page. More helpful results include the meta description data (if available) or the first few lines on the page, the date modified, file size, and a relevance ranking.

Recall and Precision (as used in Information Retrieval)

Recall describes the idea of all items which are relevant (useful) to a query. In the real world, only a subset of the relevant items are found.
Precision describes the idea of only those items which are relevant to a query. Again, in the real world, many items which are matched by a query are not really relevant to the question, although they might match the vocabulary.

In information retrieval, there's a classic tension between recall and precision. Specifying more recall (trying to find all the relevant items), you often get a lot of junk. If you limit your search trying to find only precisely relevant items, you can miss important items because they don't use quite the same vocabulary.

For example, a person might search for web site search They are looking for web pages about web site search engines, such as this site. Even if the webwide search engine finds only the pages with all of those words, it may well find many pages for site promotion, lists of search engines, academic IS departments, and the African National Congress Home Page (tested with HotBot, May 28, 1998).

If you think of the exact match for a query as a fraction, the perfect result of a query would be a list of 1.0 matches: all items would be relevant and all relevant items would be found. But that's not how it really works. For example, if you have a site about your home town with 100 pages, and 10 of them are about land use, a search for city planning might retrieve 4 pages about zoning (which are relevant) and 2 about upcoming events (which are not relevant). The recall fraction is 0.40 (4 out of 10 truly relevant pages are found) and the precision is 0.66 (4 out of 6 pages are relevant).

Relevance and Relevance Ranking

Relevance is the measure of how well the indexed page answers the question. Only the searcher can actually define this: there is no way to automate it.
When there are a large number of matches for a query, the search engines must rank the results by relevance score, sorting the results listing so that the pages most likely to be useful will appear first. They have widely varying algorithms to define relevance, some more productive than others.
 
 

Unfortunately, most Web searches include just one to three search terms, and only about 3% of them use any of the advanced searching features such as parentheses or date ranges. Search engines must concentrate on finding all the relevant documents and then ranking them so that the most useful pages appear first. Techniques such as thoughtful weighting, link tracking and click tracking provide a great deal of help in this area.

Score

Many search engines use a number to track the relevance of the document to the query, using TF-IDF, fuzzy matching, vectors, or other algorithms. This is usually a number between 0 and 100 (or a percentage), between 1 and 10, between 0 and 1. Some search engines always give their best matches the top score and rate the others compared to that, while other engines score documents compared to a theoretical perfect document. In all cases, this works better with a long and sophisticated query than a short one.
 
Many search engines treat retrieval and scoring as the same thing: documents which do not match the query are simply ranked at 0.

Search Engine

The program (CGI, Perl script, Java application or servlet, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results list to the server.

Search Forms

HTML interface to the site search tool, provided for visitors to enter their search terms and specify their preferences for the search. Some tools provide pre-built forms.

Search Terms

The search terms are the words entered by the searcher, which are part of the query, along with other instructions. The search engine will look for these words in the index, and return the matching results, usually sorted by relevance. Some search engines will allow Boolean operators, adjacency, match phrases, partial words and provide other options.

Sorting Results

Most search engines display the results list sorted by relevance. However, some offer additional sorting options, such as sorting by the date that the files were modified, by the author's name (if known), by location, and so on.

Stemming

Using linguistic analysis to reduce a word to it's root form, and then matching all forms of a word in a search query to all forms of the same word in documents. For example, someone searching for the word active might also want documents containing actively, activate, proactive, and activity.

Substring Matching: see Partial Word Matching

Taxonomy

In Biology, taxonomy is the classification of plants and animals by class, order, genus and species. In Information Retrieval, it's a set of of agreed-upon terminologies and principles of classification (and it sounds more scientific).

Term Frequency / Inverse Document Frequency (TF/IDF)

The classic way of sorting documents matched by relevance is to combine Text Frequency (how often the term is found in the collection) and the Inverse Document Frequency (how rare the term is in the collection). Pages with many instances the search term(s) which are rare in the rest of the collection are considered most relevant. This is most useful for matching long queries with many words or comparing documents (as in more like this) than for most common Web searches.

Thesaurus

A thesaurus stores synonyms and related words. This allows a search engine to map city planning to land use, for example, and show the relevant pages even if the vocabulary of the text did not match.

Vector Search

Vector searching takes the query and compares it to the text indexed to find the best match on the most words using complex mathematical formulae. This approach works best for multi-word natural-language queries such as "What is the normal weather in Berkeley?" and in finding documents similar to an existing document. Even on very large indexes, searching is very quick. However, the algorithms can use relevance ranking that many people find disconcerting: some pages can be ranked high even if they do not contain all the search terms. See also Boolean searching.

Vertical Portals

Sites which provide information on a specific topic, and . These may provide many of the services of large public portals, including multi-site search, organized information links, newsfeeds, chat, and so on, but they are directed to smaller groups such as Java programmers, investors, medical professionals, medievalists, even horse breeders.

Weighting

Weighting is a heuristic technique designed to improve the relevance ranking algorithms. The most common relevance weighting is based on term frequency in the document (see TF-IDF). Many search engines also decide that pages with matching terms in the title, keywords, description or headings are more likely to be relevant than other pages, so they add a number to the score to reflect that in the relevance ranking. Other weighting schemes recognize whether all query terms are in a page, the location in the web site, the date, and so on.

Zones

Some search engines allow searchers to limit the scope of the search by selecting a specific area or topic. For example, on a corporate information site, searchers might prefer to select "technical support" or ignore all information in "company history".