For more detailed information, see www.searchtools.com/info/database-search.html
end of intro, slide 10, next is Elements of a Search Engine
Note: Setting standard rules about what will and will not be indexed saves time and increases consistency. Notes in the Content Inventory explain exceptions and special cases.
This is a Critical Success Factor! Come back to the Content Inventory and processing issues throughout indexing
CSF - valuable data may be hidden back there. GO robots.txt example
Note: Many online newspapers and magazines provide search results for articles as teasers, to encourage purchases or subscriptions.
Note: For pages protected during transit by encryption (SSL), the search engine indexer can use an SSL client for access. The server then needs to be protected as much as the original server, and to serve results pages encrypted to avoid unauthorized access in transit . Again, work with security team on this policy.
concept of Rich Index - a CSF
stemming & stopwords - see below
Note: most HTML and XML metadata is about the document, rather than words in the text.
The Dublin Core working group agreed on 15 tags for tracking and cataloging web pages and creating metadata. These are: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier (URL), Source, Language, Relation, Coverage and Rights (copyright information). For more information, see http://dublincore.org/
Highly recommended for relevance and other issues: Modern Information Retrieval Ricardo Baeza-Yates & Berthier Ribero-Neto, Addison-Wesley Longman, May 1999, ISBN 020139829x, $55
Compare with simple text search results:
Ebay users have specific requirements and are willing to use the advanced search form.
Likewise, scientific, technical and business intelligence researchers may want to use advanced search for specific topics.
CSV: don't confuse the users!
aka Best Bets, QuickLinks, KeyMatch, Recommendations
Notes: In this example from New York Magazine, the results items include the titles as clickable links, author name, page description, section (sometimes), date. For articles where the matches are further down in the text, users may not understand why they got the match. That's why the match terms in context is so powerful.
From consumerrports.org, results items include: Titles (sometimes duplicates), dates (sometimes) in the titles, text surrounding bolded match terms, category. Note the Free flag: the other results are shown as encouragement to subscribe.
The PBS.org NewsHour search results show metadata, a frame from the video, and a dropdown menu containing that section of the transcript. The search engine is from OnStreamMedia.com.
Nordstrom.com - Note that the facets on the left side include category and color; further down are size, price, and brand. Each of them has a preview number, so it's clear how many items will be there when the user clicks.
The North Carolina State University Library online catalog: http://www.lib.ncsu.edu/catalog/
Note: Buy, don't build, unless you have a truly unique need.
For information needs analysis, see slides in first section
Notes about evaluating scale: The IRS on April 14 gets over 100 queries per second, Google in 1999 got 65 queries per second. Many intranets get fewer than 10 queries per minute.