As of January, 2012, this site is no longer being updated, due to work and health issues
December 20, 2007
SearchTools Reports Updated
On a roll now, I hope to get the whole tools section cleaned up in the next couple of weeks.
- YourAmigo is now mainly offering sophisticated search engine optimization, including database extraction for indexing. Their search engine is still available for web sites via remote hosting.
- WebCat IB 2.0 has no indication of continuing development.
- URL Spider Pro seems to have been discontinued
- Ultraseek, now owned by Autonomy, has recently added extended support for NTLM access control, Macromedia Flash and Microsoft Access file formats, more granular administration roles and a version with all interface elements, including the administration tools, in Chinese.
- Universal Knowledge Processor has no indication of continuing development.
- TYPENGO N300 Enterprise Search (from FrontLogic) has no indication of continued developm
December 18, 2007
SearchTools Reports Updated
We've updated the following reports on search engines large and small in the last few weeks:
- i411 has changed its name to Intelligenx and added autocatagorization and multiple language support.
- Engenium now has OEM library and automatic clustering module.
- FreeFind now has wildcards for excluding URL paths from indexing, indexes common office document file formats, relevance weight adjustments for URL paths (with wildcards), and some really nice indexing reports -- URLs extracted, server response, status, and which URLs are actually in the searchable index.
- HomePageSearchEngine now indexes more file types.
- Doclinx now has a web monitoring agent, with support for speech recognition, for research and competitive intelligence, and a language analyzer.
- Boolean Search now runs natively on both PPC and Intel Mac OS X systems, includes web-based admin, spellchecking and match term highlighting in search results, template and AppleScript integration for search results formatting, standalone search server, and regular expressions in queries.
- Crawl-it remote service is still being supported.
- Datagold is no longer a separate search, it's part of an online archiving suite.
- Educasoft has no indication of continuing development
September 19, 2007
Search Conferences Listing updated
This list covers all the search and related related conferences I know about. At the Enterprise Search Summit West I will be doing a pre-conference workshop on Critical Success Factors (how search engines work and how to make them better), a presentation on Tuning Search using Analytics and a moderating a panel on Good Practices for Search User Interfaces. At the Web Builder 2.0 conference, I'll be presenting on Web Site Search and the User Experience. If you are a reader of this web site, please come and say hi, and if you'd like an online presentation to your organization or company, I do those as well.
To suggest a conference for the listing, please leave a comment and I'll add it.
August 29, 2007
Critique of the Google Custom Search Traffic Report
I don't usually blog about individual search admin interface issues, but this one bothered me. I was helping a small B2B site install the Google CBSE, and looked at the search traffic report. Edward Tufte would be disappointed in it, Google. more...
August 20, 2007
Google Search Appliance and Mini - SearchTools Report Updated
I have updated my report on the GSA and Mini search appliances, with detail based in part on my recent experiences customizing a Google Mini. The report includes information on the pricing as far as I could find it, the terms of licensing, new features, links to informative documents, and features that are not included with the Mini appliance.
Once I update my full GSA product review, I will have a chance to pay attention to other search engines, and that will be lovely.
August 16, 2007
Different results, Google CSE vs. Google.com
A support document - (cached copy) - for the Google CSE (Custom Search Engine) and CSBE (Custom Search Business Edition) notes that some results may be different than those found in the same search on Google.com. It attributes this to including more than three sites in the CSE, and says that the CSE is using a subset of the Google.com index.
They recommend limiting the CSE to three sites, changing the behavior to 'Search the entire web but emphasize included sites', or adding refinements that have the same effect.
As of August 16, 2007, the support note says "We're working to bring more complete results to all Custom Search Engines."
August 3, 2007
Google Launches Site Search Service for Business
Google's Custom Search Business Edition uses the Google web search index limited by site or sites. It provides most of the Google web search features and is very cheap, only $100 per year for up to 50,000 pages, $500 for up to 500,000 pages. More here at my InfoToday article. / more at the Searchtools Google Service report page.
May 3, 2007
Swish-e - SearchTools Report Updated
A free open-source Unix search engine, Swish-e is fast at indexing and searching, and quite flexible. It can handle simple authentication, indexes HTML, text, XML, and (via converters), PDF, MS Word, Excel and MP3 ID3 tags, with an emphasis on storing fields/tags for specifying during search. Results can be sorted by relevance, date, size, and other fields. It runs as a CGI to a web server (Apache recommended), and has a fairly active user and developer base. New features include adjustments to the relevance algorithm, "near" operator and "?" single character wildcard operator (in addition to "*").
DolphinSearch - SearchTools Report Updated
With an unusual search algorithm based on neural networks and dolphin research, this search appliance is designed for legal research and corporate compliance, and integrates into an enterprise document management system.
April 24, 2007
Analysis & Review of Thunderstone Webinator Search Engine
I cover every aspect of the Thunderstone Webinator search engine, looking at what's possible, what's special and what's missing. I've been much helped by the posts on the Webinator support mailing list and the frank answers from Thunderstone's representative, as well as several working indexes on one of their test Search Appliances.
April 15, 2007
IBM OmniFind Yahoo Edition -- new SearchTools Report
This free search engine, based on the open-source Lucene core, is a reasonably full-featured search that can index up to 500,000 pages, making it an interesting competitor to the Google Search Appliance, Autonomy Ultraseek and Solr, as well as lower-end search engines.
Features include an automated install package for Windows and Linux, browser administration, a powerful web crawling robot, file system remote crawler, index support for over 400 file types (using the Inside Out system for file reading), query parsing recognizes Internet Query Operators and Boolean operators, provides a spellchecker, synonym and suggestions, and Lucene-based stemming. It indexes and searches Arabic, Czech, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Italian, Japanese, Korean, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Simplified Chinese, and Traditional Chinese. Searches can be sent via REST, and the results formatted within the admin interface, or sent back as ATOM, HTML with XSLT or XML, and linked to optional local document caching. Enterprise support is available from IBM.
There are some first-release glitches, but it's a well-designed package that's easy to use interactively, with some powerful automation interfaces ready for those who need more flexibility. Definitely worth a look.
April 11, 2007
Info Today Report: "Enterprise Search: Deployment, Usage and Trends"
A survey of 250 professionals connected to search in their enterprise has some enlightening results. They were a fairly wide variety of industries, organization sizes, departments and roles (described in detail in the report), so the results are generally applicable.
This survey contradicts conventional wisdom by reporting that 62% of these enterprises have more than one search engine, with a 27% of having four or more search engines. In my view, this indicates the understanding that one search cannot solve all problems, and that some areas will require specialized, and usually more powerful, search solutions.
The other response which surprised me was that 20% of respondents said they already provide search for audio and video, and 35% said they want to do so in the future. I suppose some of that is podcasts and training videos, and it's a big challenge for search, although much easier if there are transcripts or textual captions.
The report also covers integration with other applications (mainly CMS and KM), current search solutions, vendor support satisfaction, software vs. hosted search vs. appliance (only 17% reported using a search appliance), upgrade plans, and search features currently available and desired for the future. There's a long section about the respondents' relative emphasis on various criteria for selecting a search solution, covering ease of use, features, integration, cost, scalability, speed, vendor reputation, ease of installation, upgradability, and vendor support.
This report is available on the Enterprise Search Center, at a cost of $495 US. The study was conducted by Shore Communications and Faulkner Information Services.
March 30, 2007
Solr Open Source Enterprise Search and Faceted Metadata Server - new SearchTools Report
The Solr Java open-source search engine builds on the Lucene engine, adding more standard tools for indexing, query processing and sending back results. While Solr does not have a site indexing crawler, it can use Nutch or any other robot crawler, and accept content converted from any native format to a simple XML schema. The architecture provides powerful tools for analyzing and transforming text to create a very rich index with structured fields. It accepts a wide variety of query operators, and parameters to control retrieval and ranking, including sorting on specified fields. The default relevance ranking can be tuned to suit the needs of the users and content, and the search results provide the valuable match terms in context for each item. In addition to lists of results, Solr has faceted metadata displays dynamically calculated for search results, allowing users to drill down on topics, date ranges, price, brand or any other attribute. For scaling to millions of documents and high search traffic, the system offers caching configuration, index replication, autowarming for new starts. It's installed on such high-traffic sites as CNET, Shopper.com, and The Internet Archive, showing its scalability, and has active developer and user mailing lists.
February 15, 2007
Alkaline - SearchTools Report Updated
Vestris Alkaline, from Switzerland, has been around for a long time but is still very actively updated. Running on Unix and Windows NT, it has a web crawler than can handle multiple sites, with extensive rules options for including and excluding pages by url and extension. It is mainly focused on web pages but external filters allow indexing of XML, PDF, Microsoft Word, WordPerfect and other documents. Can handle password and Windows NTLM access control, but displays all results (no hit-level authentication). Query features include internet and Boolean operators, wildcards and number search; admins can adjust results weighting using a local GUI configuration interface. Standalone search server can run on any port. Written in C++ for binary distribution, but source code licensing is available. Low price: free for noncommercial sites, $350 for commercial sites.
February 14, 2007
i411 - SearchTools Report Updated
i411 is a faceted metadata search and browse engine, capable of scaling to very large deployments, such as the DexOnline yellow pages site, which uses it for both search results and browse navigation. The most recent version adds a web crawler to the local file and database connectors, a natural language module that can extract entities from queries and provide concept-based spellcheck, more flexibility in the search flow, and a SiteOptimizer analytics and reporting module to expose site dynamics and user behavior.
(Disclaimer: I consulted with DexOnline and helped them choose the engine among a very strong field of candidates.)
February 1, 2007
DataparkSearch - SearchTools Report Updated
DataparkSearch is a free open-source search engine written in C by some smart people in Russia as an offshoot of the MnogoSearch project. The biggest strength of this application is how well it handles languages and character sets. It supports internationalized domain names and proper word-segmentation (tokenization) in many languages including Chinese, Japanese, Korean and Thai. It can perform language detection on both text files and user queries. Spellchecking, abbreviation and synonym query expansion are on a per-language basis. This search engine has some fancy Information Retrieval features like fuzzy searching, Boolean queries, and their own "Neo popularity ranking" based on neural network research and link analysis. Results templates include several innovative than simple listings (I'm not sure if I like them, but they're interesting). It also has caching of the index files, search templates and the code, and can distribute indexes and search servers among multiple machines, for better responsive time. However, the code is distributed in source format for local compilation, and all the features are set via config files and runtime parameters -- it requires some comfort with command lines and programming tools. But there's an excellent manual and an active forum of users and developers.
January 22, 2007
Subject Search Server (SSServer) SearchTools Report Updated
Subject Search Server indexes local text and HTML files only, handles many languages and character sets. It has a technical and ambitious interface, offering control over the length and number of extracts to display, as well as defaulting to fuzzy search -- matching parts of the query terms rather than the more standard exact match -- which can be changed in the search form. Although it's free (with a link to the Kryloff site) on Windows, Linux and FreeBSD, it lacks a robot spider for indexing via HTTP, has less-than-user-friendly search form and results pages, and has no admin interface, using configuration files.
Webetiser (formerly re.s@earch suite) SearchTools Report Updated
Xapian Code Library SearchTools Report Updated
Xapian is an active open source high-performance text retrieval system, based on years of research and scalable to very large sets of documents. It now includes the Omega search engine, an application that implements the code library and makes it relatively simple to install and run.
Harvest web-crawler no longer developed
The Harvest suites of web crawling and robot spidering tools are no longer being developed. They were some of the first developed in the field, and were widely used in the 1990s for locating and collecting information using multiple standard protocols.
Docfather Siteforum Search is no longer available
The Siteforum Docfather search engine is now part of the SFS-Software Siteforum Online Enterprise Suite.
RuterSearch is discontinued
It was a personal project and is no longer distributed.
January 19, 2007
Thoughts about handling empty queries
In many sites, I see a surprisingly big percentage of empty queries. I have some thoughts on why they're there and what to do, but am looking for more ideas.
January 17, 2007
2007 Search Conferences list posted
This list covers all the search and related related conferences I know about. To suggest a conference for this listing, please leave a comment and we'll add it.
January 11, 2007
Dieselpoint Faceted Metadata Search and Browse Engine, Report Updated
Dieselpoint is a pure Java search engine that indexes metadata and field attributes as well as text and documents, allowing users to navigate within search results - faceted metadata search and browse. It indexes HTML, XML, pdf (Adobe Acrobat), Microsoft Office, and text files, as well as JDBC connectors for
DB2, Oracle, Informix, MS SQL, MySQL, PostgreSQL, and Cloudscape,
and near-real-time incremental index updating. It can handle up to forty languages and provides stemming for European languages. Additional features include spellchecking, synonyms, special weighting of search results, and extensive search administrator reports and controls. Definitely worth looking at for faceted metadata, at a lower price than most others.
80-20 Document Management Solution, Search Tools Report Updated
The 80-20 Retriever search engine is no longer sold separately: it's now part of the 80-20 Document and Records Management Solution, which is designed for enterprise-wide management and search of documents within an organization.
January 4, 2007
Eight Principals for Good Search Suggestions
A practical approach to making best use of resources when creating and maintaining search suggestions.
SearchTools News, like a Blog, since 1998. / RSS feed / (validates as RSS)
» Blogs that link here (via Technorati)
Site change details
For earlier news, see the 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages
Last Update: 2007-12-20