Home Guide Tools Listing News Info Search About SearchTools

As of January, 2012, this site is no longer updated, due to work and health issues.

Search Tools News 2010

Like a blog, since 1998

Big Data: UK Web Archive goes online

The British Library and IBM are working together on the UK Web Archive, which will store all accessible UK web pages, providing researchers with a great repository of British academic work, e-commerce, opinions, and popular culture that may change radically or disappear without notice.

IBM is providing software expertise, and using it as a testbed for text-mining Big Data, estimating that it will be adding 220 Terabytes per year as of 2011. BigSheets (presumably a pun on BigTables) includes both open and closed source software. They have shown various interfaces including spreadsheets, tag clouds, and multi-bubble charts.

I wrote an article about it for InfoToday: British Library and IBM Team Up on Web Archiving Project, and I wrote some additional notes and thoughts.

UI: Google's real-time items in search results page

Enterprise search will expect real-time results from their search engine.  I take a look at the way Google has implanted it this week.

"no matches" Interfaces for Search: a Pick and a Pan

On your site, intranet or enterprise search engine, what happens if a search engine finds no match for the search terms?

Below the link are two different approaches, one slightly verbose and the other so terse as to be baffling. Look a them, look at yours, look at my page on good things to do with the no matches page, and see if there's something you can do better.

screenshots of good and bad interfaces
to deal with no matches for a search

Readers: if you have any good or bad examples, or "before" and "after" screenshots, link me to them, please! I'll post the best ones, by which I mean both good helpful interfaces and really awful ones.

Leave a comment

This points out that ignoring stopwords is opaque to users, who expect that if they type even the words "a", "an", and "the", the search engine will find them. 

In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).

Web search engines have since found efficient ways to index and store even stopwords, because they are so valuable once in a while. So smaller search engines should follow their lead.

Posted from Diigo. The rest of my favorite links are here.
Speller Challenge (spellchecking algorithms for search)
The Speller Challenge  - build the best speller that proposes the most plausible spelling alternatives for each search query.  It uses the TREC 2008 Million Query Track for training and the Bing Test Dataset for evaluation.  The first prize is $10,000 and the gratitude of the orthographically-challenged.
SearchTools links & notes on Diigo
slightly circular to link to my library, but it's much shorter.
presentation files SFBay Apache Lucene/Solr Meetups
Interesting presentations, some on Lucene or Solr, some on more general questions of scaling search, relevancy, etc.
ManifoldCF - Apache incubation (formerly Lucene Connectors Framework)
Open-source project for connecting to Documentum, FileNet, LiveLink, Meridio, JDBC, Windows fileshares, and SharePoint.  Good for search engines needing to index content from these repositories, includes code for Solr indexing.  
Semantics in Practice - Enterprise Search Center
Definitions of semantic technology and descriptions of applications such as recommender systems, especially for encouraging longer visits to newspaper and content sites.  I have yet to see evidence that natural language search is particularly useful, but enriched retrieval and relevance is always good.
IBM - A comparison of collection types in OmniFind Enterprise Edition, Version 9.1
IBM documentation, summer 2010
Google Mini - Help Center
Help specifically for the Google Mini - small version of the Google Search Appliance
Duplicate content: causes and solutions - SEO 101 - Yoast
Very clear and straightforward explanation of how duplicate documents cause problems, where they come from, and how to cope with them.
Searching for Leisure, Travelmatch and Real-time search (meetup writeup)
Notes from presentations by Stephen Arnold on real-time search, Martina Schell about UI discoveries at Travelmatch, and Max Wilson on searching for "leisure" purposes.
"Understand your data!" - Iain Fletcher on optimising search technology at Online Information 2010 - Martin Belam's currybetdotnet blog - December 7, 2010
Notes from a talk at the Online Information conference last week, mainly on the differences between web (Google) search and enterprise search.  
Defining Your Search Engine - Google Custom Search APIs and Tools - Google Code
The "context" file is an xml configuration file for Google Custom Search Engines -- the same configuration as the browser Control Panel, but flattened out so it's easier to generate programmatically and add a bunch of repeated settings: I also use this for version control.  
Solr Digest, November 2010 « Sematext Blog
Bug fixes and new features in Solr include Polish language stemming, fixes for shard problems, sorting fixes, spacial search and a memory bug.  It also links to an interesting discussion on near-real-time search.
Upcoming Industry Events [and academic conferences]
InfoToday.com hosts this very nice listing of information access conferences, worldwide.  
Information flow part 4: Search statistics for our enterprise search - sys 64738
Puts search log analytics into context, very helpful article.
Large Scale Search Blog | www.hathitrust.org
HathiTrust is using Lucene/Solr to index a huge digital library, getting bigger all the time.  The blog covers both user oriented features and the logistical and tactical challenges of dealing with billions of tokens, in clear language rather than technical jargon.
Stopwords section - Search User Interfaces | Marti Hearst | Cambridge University Press 2009
This points out that ignoring stopwords is opaque to users, who expect if they type "a", "an", and "the", that the search engine will find them.  From the book: In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).  Web search engines have since found ways to index everything.
Autonomy's "Put FAST in the Past" Rescue Program
Microsoft will only be developing FAST for Windows in the future, and will be cutting support for Unix and Linux versions.  Autonomy is aggressively marketing to those customers. "Autonomy will match an organization's Microsoft FAST license implementation with like-for-like capability on all platforms – for 50% of the organization's original license fee for orders placed before December 31. Autonomy will provide conceptual search and a Sharepoint connector free of charge Autonomy's Microsoft FAST to IDOL migration tool will index an organization's data, enabling a seamless migration Autonomy IDOL Enterprise Search can be used transparently from within Microsoft applications including Word, SharePoint, etc., providing end users with a seamless and easy transition"
Search implementation maturity level
A set of useful measures to classify the sophistication of a search implementation, and clarify what steps it would take to move up from one level to the next. I have some quibbles about the exact order of steps, but I really like the overall approach.  
WSDM2011 conference (2011-2-9)
Web Search and Data Mining - international ACM conference, Hong Kong, during February 9-12, 2011.
Celebros - search, navigation & analytics solution for online stores
concept-based semantic e-commerce search.  I haven't tested it yet.
Nextopia e-commerce search
Offers search for online catalogs, with images and faceted search options.
Lucene Java 3.0.3 and 2.9.4 (bug fix releases)
Bugfixes for Lucene Java 2.x (old branch) and 3.x (new trunk).
Constellio | Open Source Enterprise Search
Constellio is transitioning from a closed to open-source search engine, based on Lucene/Solr and compatible with Google Enterprise Connector Manager.  
Contegra Systems | Services
A systems integrator for information management, they work with several search engines including dtSearch, Exalead and FAST.
MaxxCAT - Enterprise Search Appliances
Hardware-software combination designed for easy connection to data sources, lightweight JSON API, scalability to hundreds of millions of items with fast response and high availability.  These are significantly cheaper than the Google Mini and GSA, and the licenses don't time out.
Fusing Enterprise Search and Social Bookmarking - MIKE2.0
More practical than most social search proposals, this treats public bookmarks as a form of metadata to be included in relevance and results display.  It also has a note about the value of weak social ties in diffusing information beyond one's normal circle.
Norconex - Enterprise Search Experts (Ottawa)
Enterprise search consultancy in Canada, also provides a rich search analytics application.  They work with Attivio, Autonomy, Coveo, Endeca, Exalead, Lucene, Microsoft Sharepoint Search.
Yippy » Site Search service (formerly Vivisimo Hosted version)
Free site search service, cloud-hosted, though it's not clear what they will do once ask.com stops crawling the web.  
Banckle Hosted Site Search Engine
Interesting beginnings of a cloud-hosted remote search service, but it's still really in the alpha stage.  The admin interface is all in Flash, not a lot of features in it yet.   In addition to site search, Banckle is offering chat, email, remote access and file sharing apps, so maybe in competition with Google Apps and Zoho.  
Search Interface Inspector - systematic search UI evaluation
Attempts to quantify search user interface quality, based on a standard array of users and information needs.  I'm not sure about those assumptions, but it's certainly interesting.
Robert Capra UNC Home Page
Interesting researcher
Enterprise Search Meetup: exploratory search, TravelMatch and Stephen Arnold
Interesting write-up of the search meat-up
Extracting User Interaction Information from the Transaction Logs of a Faceted Navigation OPAC
Search session analysis, looking at sequences of search actions in the NCSU library catalog, which was then running on Endeca.  They found a gratifying amount of facet clicking.
Findability by Findwise | Search Driven Solutions
Enterprise and customer service search consultancy with offices in Sweden, Norway, and Denmark.

SearchTools News, like a Blog, since 1998. / RSS feed / [Valid RSS] (validates as RSS)

» Blogs that link here (via Technorati)

Site change details

For earlier news, see the 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999 and 1998 news archive pages

Last Update: 2010-12-09