Some limits are in effect, some traits may not be intuitive

ht://Dig traits

ht://Dig indexes words, weighting them based on position in text and HTML elements. The URLs where the word or combination of words searched for is found, gets points based on the weight of the locations of those words in the URL.

ht://Dig has less functionality compared to other popular search engines, like AltaVista.
Some of its most obvious limits are:

No search on "phrases"; consecutive words.
There is no "near" function in search methods.
Matching punctuation is not possible.
The "not" operator in the boolean search method is non-intuitive and does not work as in other search engines. In ht://Dig, it is a binary operator and behaves like it was named "without".

System limits of this installation

Some of these are necessary because of the CGI resource limit policy of my IPP, pair networks; some are further restricted.

Not more than 3 Mbyte data
of which 1 Mbyte can be locked.
No search must take more than 10 CPU seconds.
No search must take longer than 4 "real" minutes.

This means that you may see an error-page excusing for the shortcoming, instead of a results-page. If the search time would have exceeded 4 minutes, a retry may be successful.

This ht://Dig configuration

The saved excerpts (seen in search results of the default "long" type) are only 200 bytes long.
Words consist of alphanumeric characters and the _ character.
Punctuation characters delimit words. The only folded punctuation character (the valid_punctuation attribute) is the * character.
Four-digit numbers and decimal numbers starting with zero are not searchable.
Leading or trailing _ characters are removed from words before indexing; you need to remove them when searching.
Only the first twelve characters in a word are significant. This is the default compile-time limit of ht://Dig.
Many words are listed as "bad"; they will not be searchable because they are too common to be useful. Words two characters or less are not searchable.
Attachments (as far as recognized) are not indexed, except for the first that has MIME-type text/plain or text/html.
Data which is uuencoded (whether in attachments or not) is not indexed.
The mail-headers, as far as recognized, are not searchable.
The mail-headers are not included in the excerpt.
Some messages are not indexed, due to the commercial content without obvious egcs-related interest (a.k.a. spam (TM)).
Others are not indexed because they contain just about nothing but faulty source code or a large attachment that was not recognized by other machinery.
Patches (unified and context, generated by for example, GNU diff or CVS) are not indexed. The rationale is that it is very unlikely that information in the patch is interesting enough for indexing, while containing lots of new "words"; noise. To index the patches would be equal to indexing the sources, which is not the goal here; there are better methods if that's what you're looking for; the grep and etags programs come to mind. The ChangeLogs that come in plaintext with the patches are of course indexed, so any need for history-browsing of changes should be catered for.
As far as recognized, quotes of original messages in reply's are considered not as important as the reply, and may or may not be included before the reply in the excerpt.
Only the exact search algorithm is used. No synonyms or endings.
The database is updated sporadically, most often once a week.
There may be bugs in the setup. The ones that cause the search to come up with an internal search engine error are easily spotted and tracked. I fix those as soon as possible, and list progress on the timeline page.

The indexed mailing list contents

The old hypermail setup for the egcs mailing lists (up to February 1999) saved the messages in something that is a mix between the original message and HTML. Text that contains characters associated with markup is not straightforwardly viewable in a browser, but worse, a search engine will miss and misinterpret information. Fortunately, it is easy to parse this pseudo-html, and pick out the information that is worth indexing. An external parser is used for that purpose in this setup.

Things became a little bit easier with MHonArc, although the need for specific parsing still applies, for example to avoid indexing attachments and irrelevant parts of message headers.

There is a difference in what is shown in the excerpt from the older hypermail archives and the newer MHonArch ones, but all in all a difference that I believe is not worth fixing. The hypermail based ones had author name, email address and time of message, while the newer ones just have the author name.

Last modified: April 24, 1999

Complaints to webmaster@bitrange.com