As of January, 2012, this site is no longer being updated, due to work and health issues
Search Tools Product Report
DataparkSearch Engine
Product Information
Price: Free.
Platform: Unix: Centos, FreeBSD, Red Hat 8 and 9, Fedora, Solaris 9 and 10, Mandrake (Mandrivia), etc.
DataparkSearch Engine is an open source search engine written in C. It is an offshoot of the MnogoSearch project.
Features
-
Supports the protocols: http, https (SSL), ftp, nntp and news
-
htdb virtual URL scheme support for indexing SQL databases
- Handles
Internationalized Domain Names.
- Uses If-Modified-Since for efficient transfer of only changed files.
- Can tweak URLs with session IDs and other weird formats, including some JavaScript link decoding
- Can perform parallel and multi-threaded indexing for faster updating.
- Flexible update scheduling, including options for checking some sections of a site more frequently
- Handles basic authentication (user name and password) and cookies
-
Indexes plain text, html, xml, MP3s, and GIFs natively.
-
External parser support for other document types, including Microsoft Word, Excel, RTF, PowerPoint, Adobe Acrobat PDF and Flash.
- Stores a compressed text version of the documents for extracting and viewing
- Can specify a default character set and language for a server or subdirectory, or a list of possible languages.
- Automated character set and language detection using the lead developer's N-Gram-Based Text Categorization technique.
- Index is stored in a database, can be MySQL,PostgreSQL, iODBC, unixODBC, EasySoft ODBC-ODBC bridge, InterBase, Oracle, or MS SQL
-
Provides word segmenting (tokenizing) for Chinese, Japanese, Korean and Thai.
- Noindex tags: <!--UdmComment-->, <NOINDEX>, <!--noindex-->
- Can specify a content body tag
-
Summary Extraction Algorithm automatically sums up each document in three sentences.
-
Options to query with all words, any words, or Boolean queries.
- Spellchecking with ispell or aspell
- Synonym, acronym and abbreviation query expansion based on editable dictionaries, specified by language and charset
-
Multi-language search pages detect browser language preferences.
-
Offers an accent insensitive search option.
- Complex options for search zones
-
Fuzzy searching based on acronyms and abbreviations.
-
Results can be sorted by relevancy (using vector calculation), popularity rank as "Goo" (adding weight for incoming links), and "Neo" (neural network analysis), last modified time, and by "importance" (a combination of relevancy and popularity rank.)
- Interesting and innovative templates for search results pages, flexible options and commands for customization
- Can scale to at least 300,000+ pages.
- Effective caching gives significant time reduction in search times.
- Mirror indexes can distribute the search load.
-
Includes an indexer and a web CGI front-end, as well as a
search module for Apache web server.
- Distribution is by source: you must compile and make the binaries locally.
- Admin is by config files and runtime parameters
- Query logging stores the query and the number of results found.
- The
DataparkSearch Reference Manual is well done (Russian version as well.)
-
There are active and searchable forums in English, Russian, and Spanish, there's also a wiki.
- The web site is supported by Google Ads.
Articles and Reviews
- In 2005, DataparkSearch participated in the US National Institutes of Standards and Techology's Text Retrieval Conference (TREC). Their submission in PDF.
Examples