Annotated Bibliography on Information Retrieval
Recommended Reading for IR Research Students [PDF, 1.5 MB] SIGIR Forum, December 2005 by Alistair Moffat, Justin Zobel and David Hawking.
Extensive annotated bibliography of the most important works in Information Retrieval since 1997. Covers topics including TREC results, scaling issues, index compression, multilingual retrieval, multimedia retrieval, statistical, vector and probabilistic approaches, evaluation and testing, and much more.
Oracle acquires TripleHop MatchPoint
Triplehop MatchPoint search software was acquired by Oracle last June. Support continues for existing customers; a migration path may available as Oracle integrates Triplehop's technology into its own enterprise search efforts.
Analysis: looks like a good match
findinsite Report updated
findinsite, (formerly known as Spy-Server) runs as a Java servlet, applet, ASP.NET or remote hosted service. It offers a good set of the current search engine features, including indexing common document formats, tools for controlling indexing, stemming and synonyms, fourteen languages and cached copies of documents with match terms highlighted.
Google Mini Search Appliance requires security patch
As reported by EWeek, the security sites, Metasploit and Secunia have found security holes in the Google Mini Search Appliance that could lead to abuse by hackers. All users of the appliance should make sure they apply the Google-supplied patch as soon as possible. (See also the slightly out-dated but long review of the Google Search Appliance, and a Product Report for general information).
Slides from presentations to classes and conferences
Avi Rappoport of SearchTools has been doing some speaking at conferences and classes. If you are interested in having Avi speak on one of these topics to a meeting or corporation, please contact her.
- Organizing Information For Better Search Results - Intranets, November 2005
Finding and indexing useful content, interfacing with IA, CMSs, multimedia and structured content, and faceted metadata search.
- Search 2005 and Beyond - Intranets, November 2005
The basics -- gathering, indexing, query processing, retrieval, relevance ranking, UI, and log analysis. And then the fun stuff: alerting, IA, taxonomies, faceted metadata, multimedia, compliance, social networking, and personalization.
- Enterprise Search Overview - UC Berkeley School of Information Management and Studies, November 2005. Short explanation of how enterprise search engines work, how they are different from webwide search engines, and the various issues that affect them.
- Search Engine Marketing: A jaundiced view - UC Berkeley School of Information Management and Studies, November 2005. Short explanation of the competing forces in webwide search -- end-users, search engines content publishers and advertisers -- and how it all seems to work now.
- Enterprise Search FAQ - Enterprise Search Summit, May 2005. Extensive coverage of various aspects of enterprise search engines for intranets and web sites. Covers gathering and spidering, index issues, query processing, retrieval, relevance ranking, search form and results page user interface, maintenance, search log analysis and issues in choosing a search engine.
- Search Log Analytics, Metrics & Analysis - Enterprise Search Summit, May 2005. Describes what metrics and logs are available for search analysis, what to look for, how to use the tools, comparing navigational and topical search, and examples of addressing problem queries.
- Using Search Engines for Data Discovery on Intranets - Search Engines Meeting, April 2005. The iterative process of crawling an intranet for search also provides a dynamic view of not just the names and types of documents, but the contents.
Upcoming Search-Related Conferences
The most important search conferences for 2006. I won't be at all of them, but think they're all worth going to.
SLI Learning Search - Remote Search Service report updated
Spiderline - Remote Site Search Service report updated
Powerful search service works remotely from the company servers, indexes via a robot spider. It can index HTML, text, PDF and MS Word files, and can handle URLs with session IDs, cookies, password-protected pages and HTTPS. It includes editable synonym lists and custom weighting of meaningful words, but no search suggestion tools. Searches include Boolean support, soundex and stemming, and can be done within a zone based on URL paths. Results pages are highly configurable, and also available in XML for programmatic flexibility. The search reports include top queries, clickthrough tracking and referrers, along with raw logs. comment on Spiderline
Ultraseek - Enterprise Search Engine report updated
Since its purchase by Verity in late 2002, Ultraseek has been significantly improved. Not only has the company developed a major upgrade, it has added valuable features in interim point releases. New pricing as of June 2005, including free one-year trials, perpetual licenses, and incorporation of previously separate modules make this an even more competitive product. New features in versions 5.x include SOAP and web services support, continuous improvement in Acrobat PDF handling (including Japanese), automated tools for generating page titles, excluding navigation text from indexes, hit-level authentication, layout manager for designing results page interface and additional Search Reports, including clickthrough tracking, as well as raw logs. comment on Ultraseek
SiteSurfer -Java Search Applet report updated
Java indexing and search applet provides a GUI for search administration, indexes Word, WordPerfect, customizable applet for end-user searching, can search fields such as author, description. Works on CDs and DVDs. comment on this tool
SpyServer - Java Search Servlet report updated
Java servlet provides local server and robot crawler indexing, scheduling, simple HTML password access, European and Asian languages and character sets, templating system for results pages. Can be run from CD-ROMs, DVDs. comment on this tool
Robot Indexing Tests Updated
Obsolete and Discontinued Search Engines
The following search engines seem to be dead or obsolete:
- Visual.net (now a business intelligence tool)
- WannaSearch from Mainstay
Please leave a comment or contact SearchTools if you know anything about their status.
Blossom report added
Blossom search is a hosted (remote) search service, which indexes one or more web servers using a spider, and stores the results on it's own servers. When a user types a query, the form goes to the Blossom service, which does the matching and relevance ranking, and returns the results with links to the original pages on the original servers. It has modern search features such as stemming, spelling suggestions, match terms highlighted in context, and proximity-based relevance. Comment on Blossom
ZyIndex report updated
ZyIndex is not quite a search engine -- it's a research tool for complete recall, appropriate in situations where any missing data could be catastrophic. It's part of Zylabs Content, Records, and Knowledge Management systems, specializing in compliance, Legal Firms, Intelligence and Law Enforcement, Financial back office and related fields. Comment on ZyIndex
SWISH-++ report updated
Open source search engine written in C++ by Paul Lucas, based on the old swish search engine. Some of the newer features include options to exclude indexing of document sections such as headers and footers, handling ID3 tags of MP3 files, extensible indexing and filtering architecture and stemming options. Comment on SWISH++
Swish-e report updated
This stalwart open-source engine continues to be active, with improvements in incremental indexing, Unicode support, improvements in config files and indexing of very long files. Please note that the official orthography is finally set: "Swish-e". Comment on Swish-e
Arexera report updated
Formerly known as TEC-IMS, this search engine has European language detection, scalable architecture, document topic analysis, and indexes hundreds of file formats. Comment on Arexera
t.find (Eidetica) report updated
This remote search service is part of a suite that includes filtering and text mining, uses an intelligent spider to ignore navigation and copyright text, works with existing metadata and taxonomies. Combines known item and subject searching. Based in Amsterdam. comment on this
Indexing and Date Problems: Search Tools Report
When servers report incorrect page modification dates, it wastes indexer time, server cycles, bandwidth and everything else. This analysis describes several common kinds of date errors, and their implications, as well as some approaches for solving these problems. comment on this
Search Indexing Date Test Suite
A set of pages with known date errors and metadata overrides, for testing search engines' capacity to handle date problems. comment on this
Search Mailing Lists and Usenet Discussion Links Updated
Links for discussion groups in general, and specific products: DTSearch, FAST, Google Appliances, ht://Dig, SWISH-E and Webinator mailing lists.
Enterprise Search Summit Coming Soon
The US Enterprise Search Summit is coming up, May 17 - 18 in New York City. Speakers include Lou Rosenfeld, Joseph Busch, Ron Daniel, Tom Reamy and Peter Morville, and it should be a fascinating meeting. Search Tools' Avi Rappoport will be offering an enterprise search workshop on May 16, speaking about search reports, metrics and analytics on the 18th, and moderating several panels on search topics. I hope to see many of you there!
The European Enterprise Search Summit has been cancelled.
Search Product Reports Updated
AJ is very productive. She's updated the Open Source Xapian code library page with impressive examples, the SimpleSearch page to point at the secure NMS Perl code, the WizDoc page new articles and an example, the Windex page with price and platforms, and the WebSONAR page with features and an example.
Search Products Marked Obsolete or Discontinued
WideSource Peer Search. WebSTAR Search, Web Server 4D Site-Search. There's been no development on Websearch Perl Script or Webrom, but they're still sold and supported.
Coveo Enterprise Search - New SearchTools Report
Coveo, formerly Copernican Enterprise Search, is designed for intranets and departmental servers. It uses an HTTP robot crawler and indexes most office productivity documents as well as HTML and XML. Extensions handle SharePoint, Lotus WebAccess and offer APIs for custom document converters. It runs on Windows Servers, and integrates strongly with IIS document level security and Windows file access permissions. Includes advanced summarization and concept extraction, parametric search, query corrections and suggestions, configurable user preferences, and extensive reports.
Search Product Reports Updated
AJ Summers is updating search product reports very quickly. She's done minor updates and link confirmations to the reports on Zoom, and Zebra, with more to come soon. Because we are both slightly odd, she's starting from the end of the alphabet. comment on Zoom, Zebra and general product report issues
YourAmigo Adds Spider Linker
The new YourAmigo SpiderLinker tool makes database entries and dynamically generated content available to both internal search engine indexers and external search indexers such as Google and Yahoo. comment on YourAmigo
XML Query Engine
XML Query Engine (aka XQEngine), written in Java, is now free under the GPL license, and has a continuing SourceForce project. comment on XML Query Engine
Atomz Acquired by WebSideStory
Atomz is being bought by site analytics vendor WebSideStory. They're combining the services to create a complete set of web publishing and marketing services, "Active Marketing Suite" . The remote search service will b renamed "WebSideStory Search" and be available stand-alone. The company will likely offer improved search analytics. comment on Atomz
Search Suggestions Article Updated
New information, articles and discussion of using human judgment to supplement search results for popular queries. I formerly called this "manual recommendations" but that was too awkward. comment on search suggestions
SearchTools News, like a Blog. RSS feed (validates as RSS)
Site change details
For earlier news, see the 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages
Last Update: 2005-12-16