Publisher's Note: we are very pleased to feature this valuable research on free and open source search engines on SearchTools.com. It was originally written in around 2004 and revised in April 2006.
Appendix 1: Alkaline Features
-- From the Documentation of Alkaline
Alkaline can be viewed as two distinct pieces: the indexer or spider and the search engine. Their constantly growing capabilities include:
Indexing
- no theoretical limits in amount of indexed documents or sites
- fully remote indexing, not just local machine or local area network
- remote URL(s) defined as a base of indexing
- true spider, follows links on web pages, A HREFs, MAPs, FRAMEs, META REFRESH, etc.
- deleted pages are automatically removed and newly created pages instantly added
- grouping of multiple sites with individual options and parameters inside a same search group
- automatic support for redirected URLs, relative Location: headers, detection of circular deep redirections
- multiple indexing bases for the same index/search database
- highly configurable index/search paths, exclusion lists, index categories and file extensions
- capable of using regular expressions to define which urls to follow and what documents to index
- setting file amount, recursion and remote limits on demand
- automatic indexing of newer files only, using if-Modified-Since
- intelligent HTML parsing, link and text retrieval, supporting &...; style tags, simple error recovery
- single indexing engine for multiple search/index groups
- foreground dedicated indexing for first-time setup or fast reindexing
- multithreaded architecture with background continuous indexing
- textual cleanup, supporting accentuated characters (searching French text with or without accents for example)
- META tag support for KEYWORDS and DESCRIPTION, TITLE tag support for title
- discarding of script, style and object code
- full support for robots.txt and META ROBOTS directives, disabled on demand
- filters for indexing other formats than HTML and plain text (such as Adobe PDF)
- using external third party command line tools as filters through a documented interface
- embedded objects retrieval support for indexing other formats such as Shockwave Flash using the filter interface
- page preprocessing available through a published API before real indexing, using a filter
- Md5 document signature that identifies and ignores symbolic links and duplicate documents (such as http://www.foo.com and http://www.foo.com/index.html)
- persistent remote document retrieval, fully configurable in number of retries, etc.
- supports retrieval of secured pages on password protected sites (HTTP/1.0 BASIC authentication, NTLM support for Windows NT versions, no support for SSL)
- Alkaline-specific META tags to avoid indexing of individual pages, following links, excluding text portions, indexing META data or indexing parts of a document
- using the Alkaline memory mapped files swap to minimize memory usage
- using the Alkaline flat interval technology to stabilize the memory usage curve
- external lists of words to be excluded from indexing, rules for page inclusion, stop words, including regular expressions to define exclusions, etc.
- statistics on requests and traffic
- capable of adding/removing/reindexing URLs submitted online
- native server-side includes (SSI)
- full support for client non-Javascript Cookies
- fully parallel multithread configurable retrieval, concurrent indexing
- ability to run as a native Windows NT/2000 service
Searching
- searching remote sites
- searching any search group with a single search/index server
- searching of word sub-strings and heuristics, not just full keywords
- fully configurable output (virtually any HTML layout), using user-defined templates, with the MV4 expressions mechanism for each separate search group
- multiple page results, with any amount of results per page for each separate search group
- full web server pool architecture for immediate response at search
- denial of service, server flood protection and automatic fall-off, automatic restart on resource starvation
- searching of accentuated and non-accentuated text, full support for automatic translation of accents (é, à, etc.)
- searching in META tags
- output of META DESCRIPTION and page TITLE if available
- searching in ALT image and applet tags
- no searching in scripts
- automatic selection of case-sensitive/case-insensitive search
- automatic selection of heuristics/exact search for quoted sequences
- boolean search using + and - signs
- scope restriction to host, path, url and file extension
- results sorting by date (ascending and descending), size (ascending and descending), title and url
- results re-sorting by any of the above criteria
- four level expiring cache
- user-selection of maximum amount of results
- numeric tags, combinations such as price=345 searchable as price<34 , price=34 or price>345
- ranking weight options for titles, meta tags and document body
- weak words
Online Administration
- BASIC authentication restricted administration section with various access level username/password pairs
- fully customizable administration section, using JavaScript and XML
- extended possibilities for resellers for co-branding
- extensive search statistics and performance counters
- browsing of configurations and their individual parameters
- search 4-level cache statistics per configuration
- certification embedded in the admin section
- restart the server from the admin section
- refresh templates from the admin section
- add, reindex and remove individual urls from the admin section
- produce MRTG-compliant statistics through XML queries and plot search/load averages using MRTG
Appendix 2: Fluid Dynamics Search Engine (FDSE) Feature
-- From the Documentation of FDSE
- FDSE is a search engine that you install on your own site. Visitors to your site use it to find files on your site or on a small cluster of sites. The search box at the top of this page is an example of how FDSE is typically used
- FDSE is different than Google or Altavista, which search the entire Internet. FDSE only searches the sites that you tell it to. It can handle about 10,000 documents in all, which is plenty for one site but much fewer than the total number of documents on the Internet.
- FDSE is smaller than Google or Altavista, but it is qualitatively identical to them. It has its own built-in web robot for retrieving files, which means it is not limited to searching only documents on its own server. It builds its own index files and returns results from them, unlike some "meta-search" scripts which make behind-the-scenes requests to major search engines to gather results.
- FDSE runs entirely on your server, so visitors aren't redirected to a separate centralized server to get their results (as with Atomz and Freefind). If your web server doesn't support Perl CGI at all, then you might be better off with one of those remotely-hosted solutions.
- FDSE is a flat search engine - it accepts keywords and shows a ranked list of search results. It does not organize pages into browsable categories and subcategories like Yahoo does.
Good Features:
- Unrestricted full version download - you can try before you buy.
- Code executes 100% locally on your own server - no dependencies on other sites or companies.
- Code is 100% pure Perl - no dependencies on external modules or system calls.
- No forced banner advertisements to distract your visitors.
- Extras are optional. You can use the system with mysql databases -- but you can always use plain file databases as well. You can configure your own keyword-triggered banner ads -- but that's your choice, it's not forced on you.
- Platform independence - runs well on Unix, Linux, Windows NT, Windows 2000, Win95/98/ME.
- Completely template-based: you control the entire look-and-feel of the site by editing text/html template files. No need to edit the source code... though you can do that too. You can always preserve your existing templates and data when upgrading or re-installing the product.
- Dependable user support, featuring large help files, an active discussion forum, and quick email responses.
- Code is modular and heavily commented for the benefit of those who want to be hardcore. Can be called as an API from another Perl script. Format of all data files is documented in the help file.
- Highly customizable filter rules allow you to programmatically control which web pages are included in the index. Filtering can be done based on patterns in the hostname, URL, or Document Text, or based on RASCi and Safesurf PICS headers.
- Resource-intensive actions, like indexing entire web sites, are spread across multiple CGI executions, using META refreshes. This prevents web server timeouts due to excessive resource usage, and allows the action to recover if some individual CGI executions fail.
- Searches text and HTML files. Can also search PDF files with a free helper application .
- Add Your URL - any visitor can add her own website to the index, at your option. This can be turned on or off by the script owner.
- Attribute Indexing - a document's text, keywords, description, title, and address are all extracted and used for searching.
- Rich Display - the title, description, size, last modified time, and address of each document are shown to the user in the list of hits. The admin can configure the number of hits to show per page.
- Relevance Listing - documents are sorted by the number of keyword hits, so that the most relevant document appears first. Search terms found in the title, keywords, or description are given additional weight.
- Smart HTML Parsing - the search engine does not index text appearing inside of HTML tags, nor inside <SCRIPT> or <STYLE> blocks
- Attribute Searching - by default, searches find words in the body, title, keywords, URL, links, or text of a document. By using attribute:value searches, each portion of a document can be searched. The supported attributes are:
url:value (host:value) (domain:value)
Finds "value" in the web address of the document. For example, host:whitehouse.gov will only find matches on that website. The prefixes "url," "host," and "domain" all act the same.
title:value
Finds "value" between the <TITLE> and </TITLE> tags of the target document.
text:value
Searches only the actual text of the document, not the links or the URL. Due to the data structure of the index file, this attribute will include the title, keywords, and description of the file
link:value
Searches only the text extracted from hyperlinks in the document. Useful to see which documents link to a particular page, such as
"link:http://my.host.com/"
Relative links are extracted as-is, and are not expanded.
- Phrase Searching - Enclosing words in quotation marks causes them to be evaluated as a phrase. That is, all terms must occur next to each other and in order. "My bad self", when quoted, will not match "my self is bad".
- Intended Phrase Optimization - a set of unquoted search terms will be treated as a phrase first, and as individual terms second. Thus, users who don't quote their phrases will still see phrase matches near the top of the results list.
- Punctuation for Phrase Binding - words joined by punctuation will be treated as a phrase. Searching for "Bill.Clinton" (unquoted) is the same as "Bill Clinton" when quoted.
- Punctuation - insensitive - only alpha-numeric characters can be used for search terms. The characters "+," "|," "-," ":," and "*" all have special meaning (require term, prefer term, forbid term, bind attribute and wildcard match, respectively.) All other punctuation characters are treated as whitespace.
- Case Sensitivity - All searches are case insensitive and accent insensitive. Searching for "Fur" will match the lowercase "fur", uppercase "FUR", and German " für".
- Boolean Operators - Search terms or phrases may be tagged with the Boolean operators "and", "or" and "not."
"and" (shortcut: plus sign)
Require this search term or phrase
"or" (shortcut: pipe character)
Prefer this search term or phrase - each hit still must contain at least one required or preferred search term; additional preferred terms will increase the ranking of the hit.
"not" (shortcut: minus sign)
Forbid this search term or phrase - each hit still must contain at least one required or preferred search term; however, the preliminary list of hits are filtered for any existence of forbidden terms, and if found, those hits are removed.
Issues
each operator acts on the search expression in front of it. The query "dog and food" will do a default search on "dog" and a required search on "food" - it does not bind the two terms "dog" and "food" together as two required terms.
- Indexing remote files is done with a web crawler. It also operates on fixed batch sizes of documents, preventing infinite loops on robot traps or error conditions. The crawler uses the HTTP/1.0 protocol, but also supports Host headers and dynamic cookies.
- Author Control - those who don't want their pages indexed can protect their site with a robots.txt file or the Robots Meta tag, described in the Robots Exclusion Standards. . If their pages have already been indexed, the author can resubmit the pages once robot exclusion is in place, and the pages will automatically be removed.
- When optimal performance is required, the web crawler can run on a computer separate from the one providing search services. All indexing writes to a single, self-contained file that can be transferred from a search workstation up to the search server.
- International Support - all Latin extended characters are reduced to their English equivalents. For example, German "für" becomes "fur". Because this translation is done on the web documents and on the search terms that users enter, the net effect is transparent support for all non-English languages which use the Latin character set. This also enables non-English searching for users with English-only keyboards.
- Auditing - all searches are logged, with the user host, time, search terms, and number of results returned. The script owner can learn about visitor interests by viewing this log.
- Site Promotion - the script owner can force certain "preferred" web pages or sites to appear higher in the index.
- Site Blacklisting - the script owner can remove certain "blacklisted" web sites from the index. This is useful when the "Add Your URL" feature is turned on for all visitors.
Bad Features (Known Limitations and Problems)
- Latin text only - this script will index and search any text document written in a European language (Latin character set.) Two-byte languages such as Japanese or Chinese are not supported.
- Web only - this search engine runs on a Perl-CGI-enabled web server. It is not suitable for non-CGI web servers. It is not suitable for searching files when a web server is not available, such as a CD-ROM of technical support information.
- Web only - the robot does not support protocols other than HTTP. For example, FTP, Gopher, or Secured Socket Layer (HTTPS) documents can not be indexed.
- Memory and CPU Needs - this search engine was designed to provide a rich feature set, which requires more memory and processor power. There may be leaner search engines available - if this becomes an issue for you, look for a leaner engine on www.cgi-resources.com.
Appendix 3: ht://Dig Feature
-- From the Documentation of ht://Dig
- Intranet searching
-
ht://Dig has the ability to search through many servers on a network by acting as a WWW browser.
- Robot exclusion is supported
-
The Standard for Robot Exclusion is supported by ht://Dig.
- Boolean expression searching
-
Searches can be arbitrarily complex using boolean expressions.
- Configurable search results
-
The output of a search can easily be tailored to your needs by means of providing HTML templates.
- Fuzzy searching
-
Searches can be performed using various configurable algorithms. Currently the following algorithms are supported (in any combination):
· exact
· soundex
· metaphone
· common word endings (stemming)
· synonyms
· accent stripping
· substring and prefix
- Searching of HTML and text files
-
Both HTML documents and plain text files can be searched. Searching of other file types will be supported in future versions.
- Keywords can be added to HTML documents
-
Any number of keywords can be added to HTML documents which will not show up when the document is viewed. This is used to make a document more like to be found and also to make it appear higher in the list of matches.
- Email notification of expired documents
-
Special meta information can be added to HTML documents which can be used to notify the maintainer of those documents at a certain time. It is handy to get reminded when to remove the "New" images from a certain page, for example.
- A Protected server can be indexed
-
ht://Dig can be told to use a specific username and password when it retrieves documents. This can be used to index a server or parts of a server that are protected by a username and password.
- Searches on subsections of the database
-
It is easy to set up a search which only returns documents whose URL matches a certain pattern. This becomes very useful for people who want to make their own data searchable without having to use a separate search engine or database.
- Full source code included
-
The search engine comes with full source code. The whole system is released under the terms and conditions of the GNU Public License version 2.0
- The depth of the search can be limited
-
Instead of limiting the search to a set of machines, it can also be restricted to documents that are a certain number of "mouse-clicks" away from the start document.
- Full support for the ISO-Latin-1 character set
-
Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.
Appendix 4: HTDIG v. Juggernautsearch Features Comparison
by Donald T. Kasper
| Features |
Juggernautsearch |
HTDIG |
| Level of expertise required to operate program |
novice |
|
UNIX expert and programming expert
|
|
|
| Programming language used in product |
Perl |
C,C++ |
| Runs native on Linux |
yes |
yes |
| Runs native on Windows |
yes (Pro version. 1.0.1 public version does not run on Windows) |
No, requires CYGWIN UNIX utilities to run UNIX on NT |
| Same code for Linux and Windows |
yes |
no |
| Can index large numbers of Web sites |
yes |
no |
| Can index local files |
yes (Pro version) |
yes |
| Banner advertising built-in |
yes |
no |
| Uses more than 1 starting Web page to search |
yes |
no |
| Source provided |
yes (1.0 version only) |
yes |
| Can filter out keywords linking to adult (porn) sites |
yes (Pro version) |
no |
| Can filter out undesired Web sites |
yes (Pro version) |
no |
| Can filter out common English words to reduce storage size |
yes (Pro version) |
no |
| Can require at least one required word for Web page to be saved (active search) |
yes (Pro version) |
no |
| Can have many Pagerunner programs running to collect Web page information |
yes |
no |
| Can remove old (obsolete) Web pages |
yes |
no |
| |
|
|
| Limitations |
Juggernautsearch |
HTDIG |
| Have to compile the product to get an executable program |
no |
yes |
| Requires obtaining and learning a compiler program to compile the product |
no |
yes |
| Requires understanding and using very complex compiler scripts and compiler programs |
no |
yes |
| Requires a knowledge of computer programming to install and maintain |
no |
yes |
| Have to re-compile when program setup is changed |
no |
yes |
| Product can produce segmentation faults (crashes that can halt your machine) |
no |
yes |
| Search requires sorting step (makes searches take up to several minutes for a response) |
no |
yes |
| Typical response time for a query |
less than 2 seconds |
up to 5 minutes |
| Uses the obsolete WAIS standard for search (obsolete since 1995) |
no |
yes |
| Storage size required |
Smaller, as only keywords and Web addresses from Web pages are saved |
Enormous as all Web page contents are retrieved and saved to run queries |
Appendix 5: mnoGoSearch Features
-- From the Documentation of mnoGoSearch
- Full text indexing. Different priority can be configured for body, title, keywords, description of a document.
- Supporting all widely used single- and multi-byte character sets, including UTF8, as well as most of the popular Eastern Asia languages.
- Automatic document character set and language guesser for about 70 charset/language combinations.
- HTTP/1.0 support
- FTP support
- NNTP support (both news:// and nntp:// URL schemes) in standard and extended modes.
- HTTP Proxy support
- Local file system indexing support (file: URL schema)
- Supporting gzip, deflate, compress content encoding
- Built-in database support
- Different SQL databases support. Currently MySQL, PostgreSQL, miniSQL, Solid, Virtuoso, InterBase, Oracle, SyBase, MS SQL, iODBC, unixODBC, EasySoft ODBC-ODBC bridge IBM DB2 databases may be used as mnoGoSearch backend.
- Search clusters: a possibility to distribute database between several machines.
- Basic authorization support (to index password protected areas)
- Both HTML documents and plain text files can be indexed
- External parsers support for other file types (pdf, ps, doc etc.)
- Mirroring features
- Stopwords support
- "keywords" and "description" META tags support
- User defined META tag support.
- Reentry capability. You can run few indexers and searching processes at the time
- Continual indexing
- Indexing depth can be limited
- Robots exclusion standard support (both <META NAME="robots"> and robots.txt)
- HTML templates to easily customize search results
- Boolean query support
- Fuzzy search: different word forms, synonyms, substrings
- C CGI, PHP3, Perl search frontends
- Search on subsection of database
- It is very flexible. You can configure mnoGoSearch to run in different modes, including 'ftpsearch mode' (searching through URLs rather than their content),'link validation' (to check site for bad references), 'netminder' (What's new since ...?). There is also extended news support built into the package.
Appendix 6: Perlfect Features
-- From the Documentation of Perlfect
- Indexing system with support for ranking, using a document vector model, for relevant results.
- Internationalization (i18n), i.e. the result page can be in any language (templates for English, German, French, and Italian are included).
- UNIX and Windows compatible.
- Files can be indexed via the local filesystem or via http.
- Highlighting of search terms in the search result and in the pages.
- Full control of indexed content by allowing configurable exclusion of individual files and directories.
- A user-configurable list of stopwords allows you to exclude specific words from the index.
- Configurable minimum length restriction for indexing words.
- Can index PDF files (requires pdftotext, which is part of xpdf).
- The indexing process can be started via a web form
- Users' search queries can be written to a log file.
- Highly optimized compact index.
- Advanced search query options, such as keyword forcing (+) and keyword exclusion (-).
- A fast and lightweight search algorithm can produce instant results even for large sites and for the most demanding queries.
- 100% customizable output layout generated based on user-defined HTML templates. You can fully define the look of the results page and the format of the individual results listings.
- Results display in multiple pages with a customizable number of results per page.
- Automatic installation and configuration utility to save you the trouble of installing the software on your server. Answering a few simple questions over a telnet/ssh prompt will get you up and running in minutes
Appendix 7: SWISH-E Features
-- From the Documentation of SWISH-E
- Quickly index a large number of documents in different formats including text, HTML, and XML
- Use ``filters'' to index other types of files such as PDF, gzip, or Postscript.
- Includes a web spider for indexing remote documents over HTTP. Follows Robots Exclusion Rules (including < META> tags).
- Use an external program to supply documents to Swish-e, such as an advanced spider for your web server, or a program to read and format records from a relational database management system (RDBMS).
- Document ``properties'' (some subset of the source document, usually defined as a META or XML elements) may be stored in the index and returned with search results
- Document summaries can be returned with each search
- Word stemming and soundex indexing
- Phrase searching and wildcard searching
- Limit searches to HTML links
- Use powerful Regular Expressions to select documents for indexing
- Easily limit searches to parts or all of your web site
- Results can be sorted by relevance or by any number of properties in ascending or descending order
- Limit searches to parts of documents such as certain HTML tags ( META, TITLE, comments, etc.) or to XML elements.
- Can report structural errors in your XML and HTML documents
- Includes example search scripts
- Swish-e is fast.
- It's open source and FREE! You can customize Swish-e and you can contribute your fancy new features to the project.
- Supported by on-line user and developer groups
Appendix 8: Webinator Features
-- From the Documentation of Webinator
- One or more web sites may be indexed into a single database.
- Multiple databases may be maintained.
- Support for cookies.
- Support for meta data.
- Support for proxy servers.
- Robots.txt and meta robots are respected.
- Totally customizable search interface.
- Totally customizable site walker/indexer.
- A web site may be copied to the local file system.
Appendix 9: WebGlimpse Features
-- From the Documentation of WebGlimpse
- Index by subdirectory OR traverse links to a specified depth.
- Free for nonprofits (EDU version).
- Flexibility: Very flexible rules for choosing links to index, filtering & excluding files.
- Can be used on Internet or Intranet.
- Full text indexing, robot spider and file system indexing.
- Index HTML documents, Word, PDF, and any other documents that can be filtered to plaintext.
- All single-byte languages can be indexed
- HTML character code support (ü,í,à, etc.).
- Reindex from crontab.
- Limit search to recent files.
- Supports standard Boolean operators and wildcards.
- HTTP and FTP support.
- Custom ranking; "keywords" and "description" META tags support.
- Robots exclusion standard support.
- Ability to search the "neighborhood" of any indexed page (if traversing links). You configure how many "hops" defines a neighborhood.
- Option to add neighborhood search boxes to all local pages indexed.
- Configurable for multiple domains on a single server.
- Multiple domains on a single server.
- Hit highlighting and results in context.
- Uses the fast Glimpse search engine with many configurable options:
-
- Case-sensitivity
- Configurable # of misspellings
- Partial or whole match allowed
- Specify number of items returned
- Specify number of matches allowed per file
- Specify maximum number of characters printed per match
- Return only recent files