Apache Lucene is the most prominent open source search engine, and powers search on a lot of really interesting sites. The new version, 2.9, has internal improvements, re-factoring and new functionality.
Lucene 2.9 Features
"Near real-time" search: a new way to search the current in-memory segment before the index has been written to disk.
FieldCache - takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory. Also improved efficiency.
NumericField and NumericRangeQuery - (previously called TrieRange). This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.
Faster wildcard and prefix searching, and a reverse string filter to enable leading wildcards
Lucene Local (Contrib / Spatial) - can limit queries based on geographic location
Faster searching over multiple segments
Better and faster term vector highlighting of match terms in context on results page.
New Query Parser framework, supports additional syntaxes
Improvements to Payloads (metadata about index terms)
TokenStream strong typing options
Improved transaction processing
Better Chinese, Arabic, and Persian support
Backward and Forward Compatibility
There are significant changes in version 2.9 - described in the changes.txt file or the web site (change log). A very few items are not backward compatible and several classes are deprecated.
All applications should re-compile against the new Lucene 2.9 JAR 2.and test carefully.
Version 3.0 will no longer support Java 1.4 and deprecated classes.
As soon as Lucene 2.9 is released,
Carrot2 3.1.0 will come out with bug fixes Solr 1.4 will use Lucene 2.9 JAR, coming soon, few weeks they hope
Note: this content extracted painfully by Avi from the Lucene site/wiki/JIRA/mailing list archive, and clarified by Grant Ingersoll's webcast sponsored by Lucid Imagination. I will be happy to fix mistakes and clarify confusion, just comment or send a message and I'll fix it.
Most of the elements of the Google Custom Search Engine (CSE) results are tagged with class names, designed to integrate with CSS. This means that the colors and elements of the results page can conform to site standards. There's also a viability option for displaying full URLs in search results.
I've also updated the CSE AJAX Basic Example to show methods for opening clicked results pages in the same page (instead of a new one) and to add a helpful note if the CSE can't find any pages matching the search terms.
June 22, 2009
Decoding the Google Custom Search AJAX API
Also coming soon, an updated version of my Google CSE review from 2007. New features include: limited on-demand indexing, Best Bets (promotions), synonyms, new interface for filters (refinements), localized to 40 languages and offering transliteration between character sets.
My overview of the state of Twitter Search on infotoday.com. The current version of Twitter Search doesn't even try to do relevance ranking right now (it's sorted by timestamp), so it's not a Google killer yet, despite the hype.
Openfind is a leading enterprise search engine company in Taiwan, providing search to many government departments and corporations since 1998, with a system scalable to over 50 million items in their standard licence. In addition to documents, it can index text and some numeric content from relational databases, off-loading the search and spreading the server load.
OES not only handles many languages including English, Arabic, Japanese, Simplified Chinese, and Traditional Chinese, the search interface and admin interface are also available in both versions of Chinese.
The program has a long list of useful features, including indexing by file system UNC. robot crawling, near-real-time indexing of structured XML, and full-featured ODBC database connectors. It can read text, HTML, PDF, Open Office and Microsoft Office file formats, and has an API for adding other formats.
Search features include Internet Query Operators (+, -, "") and Boolean operators, including parentheses. Fields and metadata are searchable, and there are Search admins can use the web interface to edit the lists of stopwords, synonyms, autocomplete items and related terms -- in any character set. Relevance ranking includes term frequency algorithm and some heuristics, the results page UI is clean and simple, and some facets, including date and file type, are automatically generated.
There is a good suite of metrics and reports, and excellent documentation. While it's not the first search engine to have localized interfaces (Ultraseek, Autonomy and Google come to mind), OES is certainly worth a look. Comment...
Search engines need text to index: this may seem obvious but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.
Tika is the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.