As of January, 2012, this site is no longer being updated, due to work and health issues
Robots (also known as spiders, wanderers, worms, crawlers, gatherers, intelligent agents) follow links from one web page to another. They work with indexing code to store data for later searching.
There is a good deal of free open source code available -- you don't have to start from scratch. Take a look at some of the options below, in the programming language best suited for your needs. If you'd like to contract your robot out, see the Robots Consultants page.
- Harvest NG
- The Gatherer module is the robot which follows the links
- Combine Harvesting Robot
- Powerful and flexible robot control
- Libwww (Perl 5) and Libwww (perl 4)
- Perl modules for accessing Web pages, including some examples of following links.
- Agent Perl WebReview.com, August 29, 1997 by Ben Smith
- Nice tutorial about writing a search indexing spider or robot using Libwww.
- MOMspider (Multi-owner Maintenance Spider)
- Designed for checking links on multiple servers.
- WWW-Robot 0.021 (alternate 0.011 version)
- Configurable web traversal engine
- Class Acme.Spider
- A web-robot that performs a breadth-first crawl and returns URLConnections. Written by the inimitable Jef Poskanzer.
- Writing a Web Crawler in the Java Programming Language Java Developer Connection, January 1998 by Muscle Fish developers
- Describes an example program following links to get files, keeping track of those already found. Honors robots.txt. Source code available.
- BDDBot
- Java robot / search engine / web server
- NQL (Network Query Language) Java version
- SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers
- Sophisticated article from WWW7 conference about the issues involved in robot crawling. The implementation is in WebSPHINX.
C and C++
- W3C Webbot - Libwww Robot
- HTTP robot source code in C based on "Libwww", primarily designed to test HTTP/1.1 pipelining, but usable for other purposes.
- ht://Dig
- Full-featured search engine in C++, contains a sophisticated robot.
- SWISH-E
- Another full search engine with a robot spider.
- Pavuk
- A program designed to copy entire sites by following links and gathering the pages. Implemented with an interface for Mac OS X Server as epicware WebGrabber.
- Pre-emptive Multithreading Web Spider MFC Programmer's SourceBook article, June 21, 1998 by Sim Ayers
- Tutorial article on making a spider in MFC with a lot of explanation.
- TkWWW Robot
- Robot code in Tcl/Tk
- Tenmax Dataplex Robot
- High capacity web spider can handle millions of pages per day, complex HTML and even JavaScript.