The British Library and IBM are working together on the UK Web Archive, which will store all accessible UK web pages, providing researchers with a great repository of British academic work, e-commerce, opinions, and popular culture that may change radically or disappear without notice.
IBM is providing software expertise, and using it as a testbed for text-mining Big Data, estimating that it will be adding 220 Terabytes per year as of 2011. BigSheets (presumably a pun on BigTables) includes both open and closed source software. They have shown various interfaces including spreadsheets, tag clouds, and multi-bubble charts.
On your site, intranet or enterprise search engine, what happens if a search engine finds no match for the search terms?
Below the link are two different approaches, one slightly verbose and the other so terse as to be baffling. Look a them, look at yours, look at my page on good things to do with the no matches page, and see if there's something you can do better.
Readers: if you have any good or bad examples, or "before" and "after" screenshots, link me to them, please! I'll post the best ones, by which I mean both good helpful interfaces and really awful ones.
This points out that ignoring stopwords is opaque to users, who expect that if they type even the words "a", "an", and "the", the search engine will find them.
In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).
Web search engines have since found efficient ways to index and store even stopwords, because they are so valuable once in a while. So smaller search engines should follow their lead.
Posted from Diigo. The rest of my favorite links are here.
The Speller Challenge - build the best speller that proposes the most plausible spelling alternatives for each search query. It uses the TREC 2008 Million Query Track for training and the Bing Test Dataset for evaluation. The first prize is $10,000 and the gratitude of the orthographically-challenged.
Open-source project for connecting to Documentum, FileNet, LiveLink, Meridio, JDBC, Windows fileshares, and SharePoint. Good for search engines needing to index content from these repositories, includes code for Solr indexing.
Definitions of semantic technology and descriptions of applications such as recommender systems, especially for encouraging longer visits to newspaper and content sites. I have yet to see evidence that natural language search is particularly useful, but enriched retrieval and relevance is always good.
The "context" file is an xml configuration file for Google Custom Search Engines -- the same configuration as the browser Control Panel, but flattened out so it's easier to generate programmatically and add a bunch of repeated settings: I also use this for version control.
Bug fixes and new features in Solr include Polish language stemming, fixes for shard problems, sorting fixes, spacial search and a memory bug. It also links to an interesting discussion on near-real-time search.
HathiTrust is using Lucene/Solr to index a huge digital library, getting bigger all the time. The blog covers both user oriented features and the logistical and tactical challenges of dealing with billions of tokens, in clear language rather than technical jargon.
This points out that ignoring stopwords is opaque to users, who expect if they type "a", "an", and "the", that the search engine will find them.
From the book: In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).
Web search engines have since found ways to index everything.
Microsoft will only be developing FAST for Windows in the future, and will be cutting support for Unix and Linux versions. Autonomy is aggressively marketing to those customers.
"Autonomy will match an organization's Microsoft FAST license implementation with like-for-like capability on all platforms – for 50% of the organization's original license fee for orders placed before December 31.
Autonomy will provide conceptual search and a Sharepoint connector free of charge
Autonomy's Microsoft FAST to IDOL migration tool will index an organization's data, enabling a seamless migration
Autonomy IDOL Enterprise Search can be used transparently from within Microsoft applications including Word, SharePoint, etc., providing end users with a seamless and easy transition"
A set of useful measures to classify the sophistication of a search implementation, and clarify what steps it would take to move up from one level to the next. I have some quibbles about the exact order of steps, but I really like the overall approach.
Hardware-software combination designed for easy connection to data sources, lightweight JSON API, scalability to hundreds of millions of items with fast response and high availability. These are significantly cheaper than the Google Mini and GSA, and the licenses don't time out.
More practical than most social search proposals, this treats public bookmarks as a form of metadata to be included in relevance and results display. It also has a note about the value of weak social ties in diffusing information beyond one's normal circle.
Interesting beginnings of a cloud-hosted remote search service, but it's still really in the alpha stage. The admin interface is all in Flash, not a lot of features in it yet. In addition to site search, Banckle is offering chat, email, remote access and file sharing apps, so maybe in competition with Google Apps and Zoho.