Home
Guide
Tools List
News
Background
Search
About Us


SearchTools.com

Search Tools News 2009

Like a blog, since 1998

September 24, 2009

Apache Lucene is the most prominent open source search engine, and powers search on a lot of really interesting sites. The new version, 2.9, has internal improvements, re-factoring and new functionality.

Lucene 2.9 Features

  • "Near real-time" search: a new way to search the current in-memory segment before the index has been written to disk.
  • FieldCache - takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory. Also improved efficiency.
  • NumericField and NumericRangeQuery - (previously called TrieRange). This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.
  • Faster wildcard and prefix searching, and a reverse string filter to enable leading wildcards
  • Lucene Local (Contrib / Spatial) - can limit queries based on geographic location
  • Faster searching over multiple segments
  • Better and faster term vector highlighting of match terms in context on results page.
  • New Query Parser framework, supports additional syntaxes
  • Improvements to Payloads (metadata about index terms)
  • TokenStream strong typing options
  • Improved transaction processing
  • Better Chinese, Arabic, and Persian support

Backward and Forward Compatibility

There are significant changes in version 2.9 - described in the changes.txt file or the web site (change log). A very few items are not backward compatible and several classes are deprecated. All applications should re-compile against the new Lucene 2.9 JAR 2.and test carefully.

Version 3.0 will no longer support Java 1.4 and deprecated classes. As soon as Lucene 2.9 is released,

Carrot2 3.1.0 will come out with bug fixes Solr 1.4 will use Lucene 2.9 JAR, coming soon, few weeks they hope

Note: this content extracted painfully by Avi from the Lucene site/wiki/JIRA/mailing list archive, and clarified by Grant Ingersoll's webcast sponsored by Lucid Imagination. I will be happy to fix mistakes and clarify confusion, just comment or send a message and I'll fix it.

July 24, 2009

Four kinds of Google CSE Interfaces

  • Simple form, showing results on a normal Google-hosted page with minimal customization (example)

  • Form with links to a template page, JavaScript inserts iframe with search results pre-formatted (iframe example).

  • Custom Search Element - AJAX object draws a search form, JavaScript can draw result list anywhere (AJAX example).

  • XML query and result protocol (paid Site Search only) is by far the most comprehensive and flexible.

July 2, 2009

Google AJAX CSE with CSS

Most of the elements of the Google Custom Search Engine (CSE) results are tagged with class names, designed to integrate with CSS. This means that the colors and elements of the results page can conform to site standards. There's also a viability option for displaying full URLs in search results.

I've also updated the CSE AJAX Basic Example to show methods for opening clicked results pages in the same page (instead of a new one) and to add a helpful note if the CSE can't find any pages matching the search terms.

June 22, 2009

Decoding the Google Custom Search AJAX API

Google has released a new version of their Custom/Site Search service, and added an "Element" -- a wizard-driven JavaScript that non-technical users can copy and paste to their web sites, even blogs which do not allow uploading. Search Tools has a new Analysis of the CSE and AJAX API and fully-commented sample code with a live version on the same page, because this is much harder for non-programmers to customize than the forms or even the Site Search XML interface (paid version only). I'll be doing more on customizing and functionality and display during this week.

Also coming soon, an updated version of my Google CSE review from 2007. New features include: limited on-demand indexing, Best Bets (promotions), synonyms, new interface for filters (refinements), localized to 40 languages and offering transliteration between character sets.

June 5, 2009

There was a Meetup of Lucene and Solr developers in San Francisco on June 3, and I wrote up some notes. Topics: Solr 1.4, Near-real-time indexing, Payload efficiency, TrieRange, Query parser framework, Zevents, Xoopit, Lucid search, Stopwords are obsolete, and OpenRelevance. The mood in the room was very positive, everyone seemed eager to make Lucene/Solr/etc. better and better. I may have made some converts in my quest to have people think about stopwords, and try indexing everything.

May 18, 2009

My overview of the state of Twitter Search on infotoday.com. The current version of Twitter Search doesn't even try to do relevance ranking right now (it's sorted by timestamp), so it's not a Google killer yet, despite the hype.

April 20, 2009

At InfoToday.com, my article on Amazonfail: How Metadata and Sex Broke the Amazon Book Search.

April 2, 2009

A quick overview of two new textbooks on Information Retrieval

March 25, 2009

Openfind Enterprise Search (OES)

Openfind is a leading enterprise search engine company in Taiwan, providing search to many government departments and corporations since 1998, with a system scalable to over 50 million items in their standard licence. In addition to documents, it can index text and some numeric content from relational databases, off-loading the search and spreading the server load.

OES not only handles many languages including English, Arabic, Japanese, Simplified Chinese, and Traditional Chinese, the search interface and admin interface are also available in both versions of Chinese.

The program has a long list of useful features, including indexing by file system UNC. robot crawling, near-real-time indexing of structured XML, and full-featured ODBC database connectors. It can read text, HTML, PDF, Open Office and Microsoft Office file formats, and has an API for adding other formats.

Search features include Internet Query Operators (+, -, "") and Boolean operators, including parentheses. Fields and metadata are searchable, and there are Search admins can use the web interface to edit the lists of stopwords, synonyms, autocomplete items and related terms -- in any character set. Relevance ranking includes term frequency algorithm and some heuristics, the results page UI is clean and simple, and some facets, including date and file type, are automatically generated.

There is a good suite of metrics and reports, and excellent documentation. While it's not the first search engine to have localized interfaces (Ultraseek, Autonomy and Google come to mind), OES is certainly worth a look. Comment...

March 5, 2009

More on File Format Parsing

My new Guide article on File Format Parsing includes information about a commercial file parser (Divisor Offisor) and a new links listing of open-source file parsing tools.

February 27, 2009

Tika and access to text in many formats

Search engines need text to index: this may seem obvious but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.

Tika is the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.

-- For more, please see the File Format Parsing article.

comment on this blog entry

 


SearchTools News, like a Blog, since 1998. / RSS feed / [Valid RSS] (validates as RSS)

Site change details

For earlier news, see the 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages

Last Update: 2010-02-08

Home
Guide
Tools List
News
Search
About Us

Avi Rappoport of Search Tools Consulting can help you evaluate your search engine, whether it's on a site, portal, intranet, or Enterprise. Please contact SearchTools for more information.


Creative Commons License This information copyright © 1998-2011 by Avi Rappoport, Search Tools Consulting. Some Rights Reserved under the Creative Commons Attribution-Share Alike 3.0 United States License. For allowed re-uses, just attribute copied content to the page's full URL. Permissions beyond the scope of this license are available upon request.