HTML is the basic file format of the Web, but we found that half
of the sites in the survey are serving some files that are neither
HTML or plain text. Many sites serve cross-platform standard formats
such as PDF, PostScript and XML, while others serve office productivity
files, including Microsoft Word, PowerPoint, Excel and WordPerfect.
There was some confusion among our survey respondents about file
formats: some noted that they serve pages generated by server-side
processing (JSP or ColdFusion) or by backend databases. Most site
search engines can handle these because they are HTML pages by the
time they reach the client, whether it's a browser or a robot indexing
crawler. But the formats below are true binary files and cannot
be read by browsers.
Some site search engines will index complex file formats: they
may serve them by sending them to the client and allowing the browser
to launch the creating application or they may attempt to convert
them to HTML and serve them in that way.
A few search engines will index image, audio and video file metadata,
such as the file name. Virage and Excalibur can index the multimedia
data itself, although this requires a significant investment in
time and resources.
Note that web-wide search engines such as AltaVista, Inktomi and
Google will not currently index anything beyond HTML and text files.
| Formats |
without search |
with search |
| HTML |
414 |
259 |
| PDF |
171 |
152 |
| text |
147 |
119 |
| Word |
115 |
82 |
| PowerPoint |
57 |
65 |
| Excel |
60 |
56 |
| XML |
41 |
49 |
| PostScript |
20 |
29 |
| WordPerfect |
12 |
17 |
| Lotus 1-2-3 |
4 |
11 |
| Flash |
1 |
2 |
| multimedia |
1 |
2 |
| SGML |
0 |
2 |
| QuickTime |
1 |
1 |
| AVI |
1 |
1 |
| RTF |
1 |
1 |
| Zip |
0 |
1 |
| Brad |
0 |
1 |
| RealAudio |
0 |
1 |
| chemical formats |
0 |
1 |
| Applix |
0 |
1 |
| StarOffice |
0 |
1 |
| Quark |
0 |
1 |
| WordPro |
0 |
1 |
| MODCA-P |
0 |
1 |
| FFT |
0 |
1 |
| RFT |
0 |
1 |
| icl |
0 |
1 |
| compressed files |
0 |
1 |
| email files |
1 |
0 |
| downloading EXE files |
1 |
0 |
| HKE |
1 |
0 |
| Domino .nsf |
1 |
0 |
| audio |
1 |
0 |
| VIV (Vivo) |
1 |
0 |
| MPEG |
1 |
0 |
| publisher 98 |
1 |
0 |
| af3 (ABC Flowchart) |
1 |
0 |
| dot (GML) |
1 |
0 |
| PTML |
1 |
0 |
| WAV |
1 |
0 |
| MP3 |
1 |
0 |