As of January, 2012, this site is no longer being updated, due to work and health issues
Search Terms Glossary
See also the Markup and Formatting Languages
Glossary. Many terms added in September, 1998. For additional terms, we
recommend checking the the Glossary
for Information Retrieval, the Modern
Information Retrieval (book) Glossary, and the Free
Online Dictionary of Computing.
Adjacent
Searching: see Proximity
Begins-With Partial Word Matching
- Some search engines will match
indexed words that contain a search term at the
beginning -- this is a form of partial-word matching.
For example, if you're searching for "rose", the following rules
apply:
rose is an exact match
roses is a begins-with partial word match
roseola is a begins-with partial word match
arose is a partial word match but not a begins-with match
Bibliometric Analysis: see Link
Tracking
Boolean Search
- A form of logical comparison first described by George Boole (hence the
name).
Boolean operators let you define whether multiple search term are matched
within a text block (usually a web page). A Boolean expression is constructed
by joining terms together with the three special operators: AND,
OR, and NOT.
Some search forms allow searchers to indicate that all the terms in a query
must be matched (AND) or that a match on any one of
them is enough (OR). Others match pages on any of the terms (OR) and require
the searcher to add a plus sign (+) to indicate that a term is required.
Still others allow the Boolean terms themselves in the query text.
- AND (+): requires that the terms on both sides of the operator must be matched.
For example, for the text "My love is like a red, red rose" the
following matches are either true or not true.
- red AND rose is true
- red AND love is true (order doesn't
matter in these comparisons)
- red AND blue is false (only one term
matches)
- The AND operator is the most confusing, because it does not match people's
expectations. If you say "I like to eat sushi and chocolate", you
are saying that you like both foods, but not necessarily at the same time.
This is more like the OR operator. The AND operator includes an implied "together".
- OR: requires that at least one term on both sides of the operator be matched.
For the text above:
- red OR rose is true (both terms match)
- blue OR red is true (second term matches)
- blue OR green is false
- NOT (AND NOT, -): requires that the first term match and the second term
not match: order matters here. For the text above:
- love NOT lily is true (lily is not
found)
- red NOT rose is false (rose is found)
- lily NOT rose is false (lily is not
found, so rose doesn't matter)
- lily NOT blue is false (lily is not
found)
Traditionally in Information Retrieval, a Boolean search simply returns
all results that match the search conditions, without any sorting by relevance.
Most search engines apply some kind of relevance ranking
once if they have used a Boolean algorithm to determine if the pages match
the query.
Click Tracking
- Tracking user clicks on results lists allows search engines to identify
the most attractive items and move them up in the rankings. It is only appropriate
for very large search engines, since it relies on aggregates of user behavior.
-
- DirectHit provides this service,
tracking clicks in two ways:
-
- when a searcher clicks on a high-ranking item and returns immediately,
they assume that the document was not very relevant to that search
- when a searcher clicks on a low-ranking item and does not return immediately,
they assume the document was very relevant to the search
Classification
The process of organizing pieces of information into topical categories,
like the Yahoo listings. Usually, these are hierarchical trees, with the most
general topics at the top and the most specific at the bottom. A department
store might have "Products - Shoes - Women - Cross-Trainers", while
a gardening site might have a category "Plants - Flowers - California
Natives - Poppies". In either case, a searcher can understand more about
the content of the page when they know the category. Some classification
products will attempt to classify data automatically, while others assist
human catalogers.
Date Range
- Some search engines provide an option to search for documents modified on
a specific date, before a date, after a date, or between two dates. While
there are many interfaces to this feature, the most successful ones have included
a popup menu with easy-to-understand date options, like this:
Directories
Lists of pages classified into useful categories,
(like Yahoo or Looksmart).
Exact Match
- Some search engines will only match query terms to document words exactly:
they will not allow partial-word matches, fuzzy
matching or stemming.
Faceted Search
Some search engines allow searchers to combine terms with Boolean
operators and parentheses to make extremely
powerful faceted searches, for example:
( (red OR vermilion OR scarlet OR garnet) AND (rose OR rosa OR
rosebush) AND (bareroot OR bare root) ) NOT hybrid
should return only information about non-hybrid bareroot red-colored roses.
False Drop
- A page returned as a matching result that isn't actually
relevant to the search. In most cases, this is because words can be used in
very different ways. For example, the word argument means "heated
debate" in general usage, but in programming, it means "value passed
to a subroutine or other application".
Occasionally, a site index or database
gets corrupted and loses track of the search terms and their associated
pages. In those cases, you should remove the index files and re-index the
site from scratch.
The story behind this term: many years ago, librarians were trying to index
book records without computers. One scheme involved cards with holes punched
or filled on the edges of the cards, indicating the appropriate keywords.
To search, a librarian ran a set of thin rods, corresponding to the desired
search terms, gently through the holes in a stack of cards. Cards which
were not punched in a particular position did not let the rods go through.
Then the searcher picked up the stack of cards by the rods. The cards with
holes in those positions rose up, leaving the others below. The items which
technically matched but were not actually relevant became known as false
drops.
Fuzzy Matching
Exact matching is very strict: either a word matches or it doesn't. An attempt
to improve search recall by matching more than the exact
word: fuzzy matching techniques try to reduce words to their core and
then match all forms of the word.
Some algorithms for fuzzy matching use the understanding that the beginning
and end of English words are more likely to change than the center, so they
count matching letters and give more weight to words with the matching letters
in the center than at the edges. Unfortunately, this can sometimes bring up
results that make very little sense in meaning, but simply might match some
of the letters (a search for tivoli might bring up laxative).
Index File
- Created by the Search Indexer program, this file stores the data from your
site in a special index or database, designed for very quick access. Depending
on the indexing algorithm and size of your site, this file can become very
large. It file must be updated often, or it will become unsynchronized with
the pages and provide obsolete results.
Intranet
An internal institutional computer network using standard protocols, rather
than proprietary or special applications. This allows employees to use standard
Web browsers to locate information and HTML editors to publish their data.
Many Intranet portals implement search engines which
index multiple internal servers to provide access to all available information.
Link Tracking
- A form of indexing which tracks the links (aka "citations") to
a document. So you could get all the pages which link to a particular page,
or sort the results by the number of links. There have been printed citation
indexes around for years: the Science, Social Science, and Arts and Humanities
Citation Indexes are produced by the Institute
for Scientific Information and are in many large libraries, but are not
freely searchable online. Lycos, Google
and other search engines do this already
Legacy Data
Information in older file formats is known a legacy data. This can
refer to very old files which are accessible only through special reader programs,
or any non-web-native formats such as word processing, spreadsheets or graphics.
Match
- When an indexed page contains the same text as a search
term, it matches that term. Some search engines allow partial-word
and begins-with matches as well as exact
matches, while others extend the search terms by using fuzzy
matching and stemming.
When a page fits all the requirements of a query, it matches that query. So a page could match one or more of the search terms, but still not match the query as a whole.
More Like This
- Some search engines will allow searchers refine searches and locate additional
documents that apply to their question by allowing them to identify documents
and then locating similar documents. This allows the searcher to control the
direction of the search and focus on the most fruitful lines of inquiry. Using
a whole document as a query allows the engine to use sophisticated matching
techniques which are inappropriate for very short queries.
Multi-Site Search
Some search engines will index and search pages on more than one server.
This may be on a different host machine (www.example.com
vs. products.example.com), or a different domain altogether
(www.domain.edu). There may be one index for all sites, or separate
indexes which are searched at the same time and the results collated.
Natural Language Processing (NLP)
To avoid forcing searchers to memorize Boolean or
other query languages, some systems allow them to type in a question, and
use that as the query. The simplest processing just removes stopwords and
uses a vector search or other statistical approach.
Some sophisticated systems try to extract concepts using linguistic analysis,
and match those against concepts extracted by the indexer. Others try to categorize
the form of the question and use it to define the query, so "who is"
questions are not treated the same as "how many" or "why":
a good example of this approach is the AskJeeves
system. For more information, see the Information
Retrieval page.
Ontology
Used in Information Retrieval and Artificial Intelligence, an ontology
defines concepts, providing a way to move towards consistency in vocabulary.
It provides a working model of the entities and interactions a particular
topic, such as dentistry or anthropology.
"In philosophy, an ontology is a theory about the nature of existence,
of what types of things exist; ontology as a discipline studies such theories.
Artificial-intelligence and Web researchers have co-opted the term for their
own jargon, and for them an ontology is a document or file that formally defines
the relations among terms. The most typical kind of ontology for the Web has
a taxonomy and a set of inference rules." - Tim Berners-Lee, The
Semantic Web, Scientific American May 2001.
See also Classification and Taxonomy
and the whatis.com
entry.
Portal
A Web site that provides multiple services with the goal of becoming the
main site for a wide variety of users. Most of the large public portals (Excite,
Lycos, Yahoo) grew out of large public search engines, and provide multi-site
search services as well as organized information
links, email, news, and other features.
Intranet portals, sometimes called
Enterprise Information Portals, provide similar services for institutions,
Vertical portals, or vortals, are more specialized, covering
a specific topic such as an industry or hobby.
Query
- The query is the combination of the word or words used for searching,
and any options allowed by the search engine.
For example:
term="rose AND red NOT hybrid" limit="whole words" date="since -30 days"
- is a complex query telling a search engine the words to look for, their
relationship, that they cannot be partial words, and that the page must be
dated within the last 30 days.
-
- Most queries on Web search engines are very simple: one to three words.
Only about 3% of the searches ever use any of the advanced features such as
Boolean searching or date ranges.
Before the Web, information retrieval was oriented towards librarians and
other professionals who used these features extensively. Therefore, many older
search engines provide sophisticated control and excellent recall,
while newer systems tend to improve results rankings by improving adding link
tracking, more like this,
Parentheses
- Some search engines let searchers group their words, especially in Boolean
searches, using parentheses "()". See Faceted
Searching for examples.
Partial Word Matching
- Some search engines will only match
exact text, others will match the beginnings of words,
while still others will match the search terms anywhere within an indexed
word: these are partial words. For example, if you're searching for
"rose":
-
rose is an exact match
roses is a begins-with partial word match
roseola is a begins-with partial word match
arose is a partial word match but not a begins-with match
-
-
Phrase Searching
- Some search engines provide an option to search a set of words as a phrase,
either by typing in quote marks (""), by using a command or clicking
a button. When they receive this kind of search, the engines will generally
locate all words that match the search terms, then discard
those which are not next to each other in the correct order. To perform this
task, the index must store the position of the word in the document, so the
search engine can tell where the words are located.
Proximity Searching
An extension to Boolean searching, this technique
checks the position of terms and only matches those within the specified distance.
It's a good way to cut down the irrelevant matches and get better results.
Some search engines let you define the order, in addition to the distance.
For example: rose w10 heritage might mean "heritage within
the 10 words following the word rose". Search engines may also allow
phrase searching, where the words must match the query
term in the exact order.
Results List
- The web pages which match a query.
The search engines creates the list by checking
its index and saving the matching entries. Then it sorts
them, usually according to its relevance algorithm,
and generates an HTML page to display the results.
The simplest results listing is just the page title and a URL link
to the page. More helpful results include the meta description data (if
available) or the first few lines on the page, the date modified, file size,
and a relevance ranking.
Recall and Precision (as used in Information Retrieval)
- Recall describes the idea of all items which are relevant
(useful) to a query. In the real world, only a subset of the relevant
items are found.
Precision describes the idea of only those items which are relevant
to a query. Again, in the real world, many items which are matched by a query
are not really relevant to the question, although they might match the vocabulary.
In information retrieval, there's a classic tension between recall and
precision. Specifying more recall (trying to find all the relevant items),
you often get a lot of junk. If you limit your search trying to find only
precisely relevant items, you can miss important items because they
don't use quite the same vocabulary.
- For example, a person might search for web site search They are looking for web pages about web site search engines, such as this site. Even if the webwide search engine finds only the pages with all of those words, it may well find many pages for site promotion, lists of search engines, academic IS departments, and the African National Congress Home Page (tested with HotBot, May 28, 1998).
If you think of the exact match for a query as a fraction, the perfect
result of a query would be a list of 1.0 matches: all items would be relevant
and all relevant items would be found. But that's not how it really works.
For example, if you have a site about your home town with 100 pages, and
10 of them are about land use, a search for city planning might
retrieve 4 pages about zoning (which are relevant) and 2 about upcoming
events (which are not relevant). The recall fraction is 0.40 (4 out of 10
truly relevant pages are found) and the precision is 0.66 (4 out of 6 pages
are relevant).
Relevance and Relevance
Ranking
- Relevance is the measure of how well the indexed page answers the
question. Only the searcher can actually define this: there is no way to automate
it.
- When there are a large number of matches for a query, the search engines
must rank the results by relevance score, sorting the results listing
so that the pages most likely to be useful will appear first. They have widely
varying algorithms to define relevance, some more productive than others.
-
-
-
Unfortunately, most Web searches include just one to three search terms,
and only about 3% of them use any of the advanced searching features such
as parentheses or date ranges.
Search engines must concentrate on finding all the relevant documents and
then ranking them so that the most useful pages appear first. Techniques
such as thoughtful weighting, link
tracking and click tracking provide a great
deal of help in this area.
Score
- Many search engines use a number to track the relevance of the document
to the query, using TF-IDF, fuzzy
matching, vectors, or other algorithms. This is
usually a number between 0 and 100 (or a percentage), between 1 and 10, between
0 and 1. Some search engines always give their best matches the top score
and rate the others compared to that, while other engines score documents
compared to a theoretical perfect document. In all cases, this works better
with a long and sophisticated query than a short one.
-
- Many search engines treat retrieval and scoring as the same thing: documents
which do not match the query are simply ranked at 0.
Search Engine
- The program (CGI, Perl script, Java application or servlet, server module
or separate server) that accepts the request from the form or URL, searches
the index, and returns the results
list to the server.
Search Forms
- HTML interface to the site search tool, provided for visitors to enter their search terms and specify their preferences for the search. Some tools provide pre-built forms.
Search Terms
- The search terms are the words entered by the searcher, which are
part of the query, along with other instructions. The
search engine will look for these words in the
index, and return the matching results, usually sorted
by relevance. Some search engines will allow Boolean
operators, adjacency, match phrases,
partial words and provide other options.
Sorting Results
- Most search engines display the results list
sorted by relevance. However, some offer additional
sorting options, such as sorting by the date that the files were modified,
by the author's name (if known), by location, and so on.
Stemming
- Using linguistic analysis to reduce a word to it's root form, and then matching
all forms of a word in a search query to all forms of the same word in documents.
For example, someone searching for the word active might also want
documents containing actively, activate, proactive, and
activity.
Taxonomy
In Biology, taxonomy is the classification of plants and animals by
class, order, genus and species. In Information Retrieval, it's a set of of
agreed-upon terminologies and principles of classification
(and it sounds more scientific).
Term Frequency / Inverse Document Frequency (TF/IDF)
- The classic way of sorting documents matched by relevance
is to combine Text Frequency (how often the term is found in the collection)
and the Inverse Document Frequency (how rare the term is in the collection).
Pages with many instances the search term(s) which are rare in the rest of
the collection are considered most relevant. This is most useful for matching
long queries with many words or comparing documents (as in more
like this) than for most common Web searches.
Thesaurus
-
A thesaurus stores synonyms and related words. This allows a search engine
to map city planning to land use, for example, and show
the relevant pages even if the vocabulary of the
text did not match.
Vector Search
-
Vector searching takes the query and compares it to
the text indexed to find the best match
on the most words using complex mathematical formulae. This approach works
best for multi-word natural-language queries such as "What is the normal
weather in Berkeley?" and in finding documents similar to an existing
document. Even on very large indexes, searching is very quick. However,
the algorithms can use relevance ranking
that many people find disconcerting: some pages can be ranked high even
if they do not contain all the search terms. See also Boolean
searching.
Vertical Portals
Sites which provide information on a specific topic, and . These may provide
many of the services of large public portals, including
multi-site search, organized
information links, newsfeeds, chat, and so on, but they are directed to
smaller groups such as Java programmers, investors, medical professionals,
medievalists, even horse breeders.
Weighting
- Weighting is a heuristic technique designed to improve the relevance ranking
algorithms. The most common relevance weighting is based on term frequency
in the document (see TF-IDF). Many search engines
also decide that pages with matching terms in the title, keywords, description
or headings are more likely to be relevant than other pages, so they add a
number to the score to reflect that in the relevance
ranking. Other weighting schemes recognize whether all query terms are
in a page, the location in the web site, the date, and so on.
Zones
- Some search engines allow searchers to limit the scope of the search by
selecting a specific area or topic. For example, on a corporate information
site, searchers might prefer to select "technical support" or ignore
all information in "company history".
-
-
-