As of January, 2012, this site is no longer being updated, due to work and health issues

SearchTools.com

Source Code for Web Robot Spiders


Robots (also known as spiders, wanderers, worms, crawlers, gatherers, intelligent agents) follow links from one web page to another. They work with indexing code to store data for later searching.

Robot Source Code

There is a good deal of free open source code available -- you don't have to start from scratch. Take a look at some of the options below, in the programming language best suited for your needs. If you'd like to contract your robot out, see the Robots Consultants page.

Useful Links

Perl

Harvest NG
The Gatherer module is the robot which follows the links
Combine Harvesting Robot
Powerful and flexible robot control
Libwww (Perl 5) and Libwww (perl 4)
Perl modules for accessing Web pages, including some examples of following links.
Agent Perl WebReview.com, August 29, 1997 by Ben Smith
Nice tutorial about writing a search indexing spider or robot using Libwww.
MOMspider (Multi-owner Maintenance Spider)
Designed for checking links on multiple servers.
WWW-Robot 0.021 (alternate 0.011 version)
Configurable web traversal engine

Java

Class Acme.Spider
A web-robot that performs a breadth-first crawl and returns URLConnections. Written by the inimitable Jef Poskanzer.
 
 
Writing a Web Crawler in the Java Programming Language Java Developer Connection, January 1998 by Muscle Fish developers
Describes an example program following links to get files, keeping track of those already found. Honors robots.txt. Source code available.
 
BDDBot
Java robot / search engine / web server
 
NQL (Network Query Language) Java version
 
SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers
Sophisticated article from WWW7 conference about the issues involved in robot crawling. The implementation is in WebSPHINX.

C and C++

W3C Webbot - Libwww Robot
HTTP robot source code in C based on "Libwww", primarily designed to test HTTP/1.1 pipelining, but usable for other purposes.
ht://Dig
Full-featured search engine in C++, contains a sophisticated robot.
 
SWISH-E
Another full search engine with a robot spider.
 
Pavuk
A program designed to copy entire sites by following links and gathering the pages. Implemented with an interface for Mac OS X Server as epicware WebGrabber.
 
Pre-emptive Multithreading Web Spider MFC Programmer's SourceBook article, June 21, 1998 by Sim Ayers
Tutorial article on making a spider in MFC with a lot of explanation.

Other

TkWWW Robot
Robot code in Tcl/Tk

Commercial Products

Tenmax Dataplex Robot
High capacity web spider can handle millions of pages per day, complex HTML and even JavaScript.
Page Updated 2001-10-22