|
Current Search Tools News
CiteSeer is a nifty free service that indexes and searches academic papers (mostly in CS and various Information Sciences). It's been doing Automated Citation Indexing for a decade, linking cited and citing papers, which turn out to be extremely valuable for research and area studies.
Now, Professor Lee Giles and his students at Pennsylvania State University have rebuilt the system from scratch, and are sharing it, using an open source Apache license, SeerSuite (currently beta 0.1). Even a smallish digital library can take advantage of the automated metadata extraction and citation linking, with the reliable Lucene search engine underneath.&
Now, they're combining technologies (from OCR, to machine learning), and reverse-engineering data from PDF documents. This includes extracting captions and numbers out of tables, chemical formulae and molecular structures, mathematical equations, and 2D graphs, storing them in various standard markup formats. All this information is not metadata, it's source data, and incredibly valuable for avoiding duplication, allowing reproduction of experiments, and taking that data in directions the original researchers did not expect. It brings the research into the Semantic Web, where there are tools just waiting for data like this.
I wrote a bit more about CiteSeerX and SeerSuite in InfoToday, and there's more information at the CiteSeerX site. Please comment, if you care to, on the blog entry. - December 16, 2008
Wikipedia, and particularly the related sites running the software released as MediaWiki, have some of the worst site search I have ever seen. The default installation's query processing is absurdly limited, the retrieval is crippled by bad settings, the relevance is unclear, and the results page is not just ugly but contradictory and confusing. I will be posting more detailed analyses supporting each of these statements. Read more about this analysis or leave a comment - October 2008
Even worse than the intrusive markup symbols, the MediaWiki Search highlights substrings when the search only retrieves on whole words. It also highlights words that are not indexed and are never part of the search query: those of less than four characters, and words on the very large stopword listing. This is just wrong, and the main Wikipedia has fixed it (recently). However, this wrong default behavior can show up on sites running even the most recent MediaWiki version (1.14a). Read more or leave a comment - Oct. 24, 2008.
MediaWiki's site search does display the search terms matched in the article, with some extracted text from the area around the term, so searchers can understand the context of the match. (For more information, see Matching Search Terms In Context.)
But most search engines remove the page markup (HTML or other) before saving the page, or at least before displaying the results with the match terms in context. MediaWiki Search does not do this, so results will show not just the text from the page, but also hidden text (such as in graphic file names) and markup symbols. Read more or leave a comment - Oct. 21, 2008.
The MediaWiki search results page does not actually say how many articles match the search terms. It has confusing and unlabeled number links for results page navigation, and results-per-page setting, right next to each other, but no total for results. Read more or leave a comment - Oct. 17, 2008.
The default behavior of the MediaWiki search engine is to find only pages which match every word in the search query (find all). When there are quotes around terms in the query, it will only find pages with those terms as a phrase, which is nice. Unlike other search engines, however, there's no way to override this behavior. No options are available to exclude terms from search results, search for several synonyms, automatically use plural word matching, or find substrings using truncation or wildcards. More... or leave a comment - Oct. 16, 2008.
Stopwords are supposedly words that don't make sense in searches, words that are on many pages and therefore are "noise" in search results. They are generally short, such as a, an, the. However, the MediaWiki search has excludes 547 words as stopwords. But they're perfectly good words, and
by ignoring them, the search engine fails in many cases where it should find results. see more details or leave a comment
- Oct. 14. 2008.
The MediaWiki search engine defaults to a four-letter minimum word length. Seriously. Not only will it not search for one, two, or three letter words, they're not even in the index: they are completely unfindable. There is no way to search for perfectly reasonable words like fan, lab, qi, or pH. The bartendersdatabase.com wiki can't find rum while Google finds 731 on that same site (though some are duplicates). see more details or leave a comment. - Oct. 13. 2008
For more news about search engines, see the News page.
|
|
Good
question! If you have serious content, a site or intranet search
engine will allow your visitors to jump directly to the topic
they want. More...
|
|
The
guide will help you learn more
about site searching. Or try the
remote search services on this site.
|
|
Includes
articles and books providing general
site search information, and those with more specific product reviews.
|
|
Alphabetical
List or divided by platform: Java,
Mac, Perl,
Unix, and Windows;
also Remote Search Hosting Services,
Code Libraries, and Open
Source Search Engines |
|
Multimedia
Search, Faceted Metadata
Search, PDF and Web Site Search,
Intranets and EIPs, XML and Search, Information Architecture,
Web Indexing Robot Spiders, and
more. |
|
|
This
site provides information, news and advice about web site searching
technology. It is maintained by Avi Rappoport, associate AJ Summers
and various Search Tools Consulting
interns as a service to the Web community. We welcome your comments
and suggestions: just contact us.
We are also available for search tools needs analysis, competitive
analysis, search tools installation and more: for information see
the Consulting page , use our
contact form or send
.
Disclosure:
Search Tools Consulting presents the SearchTools.com site as a free
service to the web development community and is not sponsored by
any advertisers. Search Tools Consulting also provides analysis
and information to sites and institutions installing search engines,
and to some search engine developers. We do not give them site visitor
or survey personal information or allow our relationships with any
vendors to change any product review or analysis. |
|
The
SearchTools
blog on LiveJournal provides an opportunity for you to tell
me what you think about enterprise search tools for web sites and
intranets, and about the SearchTools.com web site. You do not have
to have an account to post, you can reply anonymously. All comments
are screened, so there will be no blog spam.
leave
a comment
|
For more news about search engines, see the
News page. Technorati
Profile
|
|