Home
Guide
Tools List
News
Background
Search
About Us


SearchTools.com

Search Tools News 2009

July 2, 2009

Google AJAX CSE with CSS

Most of the elements of the Google Custom Search Engine (CSE) results are tagged with class names, designed to integrate with CSS. This means that the colors and elements of the results page can conform to site standards. There's also a viability option for displaying full URLs in search results.

I've also updated the CSE AJAX Basic Example to show methods for opening clicked results pages in the same page (instead of a new one) and to add a helpful note if the CSE can't find any pages matching the search terms.

June 22, 2009

Decoding the Google Custom Search AJAX API

Google has released a new version of their Custom/Site Search service, and added an "Element" -- a wizard-driven JavaScript that non-technical users can copy and paste to their web sites, even blogs which do not allow uploading. Search Tools has a new Analysis of the CSE and AJAX API and fully-commented sample code with a live version on the same page, because this is much harder for non-programmers to customize than the forms or even the Site Search XML interface (paid version only). I'll be doing more on customizing and functionality and display during this week.

Also coming soon, an updated version of my Google CSE review from 2007. New features include: limited on-demand indexing, Best Bets (promotions), synonyms, new interface for filters (refinements), localized to 40 languages and offering transliteration between character sets.

June 5, 2009

There was a Meetup of Lucene and Solr developers in San Francisco on June 3, and I wrote up some notes. Topics: Solr 1.4, Near-real-time indexing, Payload efficiency, TrieRange, Query parser framework, Zevents, Xoopit, Lucid search, Stopwords are obsolete, and OpenRelevance. The mood in the room was very positive, everyone seemed eager to make Lucene/Solr/etc. better and better. I may have made some converts in my quest to have people think about stopwords, and try indexing everything.

May 18, 2009

My overview of the state of Twitter Search on infotoday.com. The current version of Twitter Search doesn't even try to do relevance ranking right now (it's sorted by timestamp), so it's not a Google killer yet, despite the hype.

April 20, 2009

At InfoToday.com, my article on Amazonfail: How Metadata and Sex Broke the Amazon Book Search.

April 2, 2009

A quick overview of two new textbooks on Information Retrieval

March 25, 2009

Openfind Enterprise Search (OES)

Openfind is a leading enterprise search engine company in Taiwan, providing search to many government departments and corporations since 1998, with a system scalable to over 50 million items in their standard licence. In addition to documents, it can index text and some numeric content from relational databases, off-loading the search and spreading the server load.

OES not only handles many languages including English, Arabic, Japanese, Simplified Chinese, and Traditional Chinese, the search interface and admin interface are also available in both versions of Chinese.

The program has a long list of useful features, including indexing by file system UNC. robot crawling, near-real-time indexing of structured XML, and full-featured ODBC database connectors. It can read text, HTML, PDF, Open Office and Microsoft Office file formats, and has an API for adding other formats.

Search features include Internet Query Operators (+, -, "") and Boolean operators, including parentheses. Fields and metadata are searchable, and there are Search admins can use the web interface to edit the lists of stopwords, synonyms, autocomplete items and related terms -- in any character set. Relevance ranking includes term frequency algorithm and some heuristics, the results page UI is clean and simple, and some facets, including date and file type, are automatically generated.

There is a good suite of metrics and reports, and excellent documentation. While it's not the first search engine to have localized interfaces (Ultraseek, Autonomy and Google come to mind), OES is certainly worth a look. Comment...

March 5, 2009

More on File Format Parsing

My new Guide article on File Format Parsing includes information about a commercial file parser (Divisor Offisor) and a new links listing of open-source file parsing tools.

February 27, 2009

Tika and access to text in many formats

Search engines need text to index: this may seem obvious but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.

Tika is the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.

-- For more, please see the File Format Parsing article.

comment on this blog entry

 


SearchTools News, like a Blog, since 1998. / RSS feed / [Valid RSS] (validates as RSS)

Site change details

For earlier news, see the 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages

Last Update: 2009-07-02

Home
Guide
Tools List
News
Search
About Us
Creative Commons License Creative Commons License  This information copyright © 2009 Avi Rappoport, Search Tools Consulting. Some Rights Reserved, under the Creative Commons Attribution-Share Alike 3.0 United States License. On re-publication, attribute copied content to the page's full URL. Permissions beyond the scope of this license are available upon request.