Home Guide Tools Listing News Background Search About Us

Guide to Search Tools

A First Taxonomy of Search Log Junk


Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis. Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call "Search Log Junk" (inspired by Edward Tufte's "chart junk") . Here are the types of search LogJunk that I've seen most frequently.

Empty Queries
Queries without any query text or usable parameters can appear when people think the "Search" button is important in and of itself. Or perhaps search is in the first page form, and the cursor gets into that field and users press Return. These are often sent from the home page, according to the referer fields I've seen.

shaunryanz on LiveJournal added: If you have some text in your search box - like "enter your search here" then that will inevitably be one of the most popular queries in your search logs and should be treated the same as an empty query.

The first thing is to make sure that the search engine is doing something reasonable in this case. Some database search engines default to returning every item in the table, in entry order. So something reasonable could be just bringing up a helpful search page, adding a script to bring up an error dialog, or a script to ignore the empty query. I'm leaning towards the last option.

I've found only a couple of ways to use this information. The empty queries are still useful for traffic and response time metrics, and I think it's useful to check the top referring pages occasionally. A large number of empty queries for a page deep within a site may indicate some navigation problems. But otherwise, they are logjunk.
Repeat Queries
Multiple identical queries to the search engine from the same IP or user ID. My best guess is that the client is calling for a refresh automatically -- my favorite was thousands of queries over months for two dots: "..".

Lee Romeo on SearchCOP has found complex automated queries, possibly from off site, which return no matches.

Again, this is useful for traffic metrics and possibly for identifying really weird incoming links. If there aren't any matches for these queries, and they repeat several times, they will mess up your no-matches analytics. It's really logjunk and it's reasonable to remove that from the search analytics database, which allows you to concentrate on the real data. You may also want to ban that IP address.
Robot crawlers
Having search and intelligent agents crawl search results may be a good thing. Incoming links are always good, even on intranets, and it may be that the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, there may be other robots wasting your search engine cycles: for those, a combination of robots.txt and banning their IP address will help.

s_olive on searchdev points out that Google is now sending queries to search engines and other forms in an attempt to find new URLs, as per their announcement. Any sites which allow this can benefit from additional traffic, but the clicks will come in as the same query, over and over. Checking the results of these queries allows you to make sure people will see useful content, but otherwise it's more of an incoming link than a query.
Server hacks
Search engines are attacked by the standard web server hacking parameters, such as "phpmyadmin". They may also be subject to buffer overflow and other attacks, so should be included in standard website security audits and checklists.

Lee Romeo on SearchCOP adds that he's seen very long random text pasted into the query box. I suspect this is a buffer overflow attack, and if there are a lot of them, they will skew metrics on average query length.
Search form "guestbook" spam
There are automated advertising services that insert fake comments with URLs into form fields, guestbooks, blogs and wikis (and there's a wikipedia page about them). Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting and URLs in them.

For sites with light search traffic, these meaningless entries can cause problems with both traffic metrics and top query listings. Even for sites with thousands of queries per day, they can distort statistics about the average length of query, so removing them from your analysis database is a good idea.

It's fairly easy to identify these queries with simple regular expressions looking for href, http and .com. I haven't heard of any search engines which filter this before reporting, though some may be doing it without bothering their customers about it.
Internal testing queries
For light traffic sites, any kind of automated testing, or even heavy manual testing can change the search log significantly -- especially given how quickly the Long Tail shows up. Remove queries from testers by user ID or IP address to look at real user data.

Arguments? Questions? Compliments? Please leave a comment

Page Updated 2008-05-13

Home
Guide
Tools Listing
News
Background
Search
About Us
SearchTools.com - Copyright © 2008 Search Tools Consulting
This work is provided under a Creative Commons Sampling Plus 1.0 License.