Some limits are in effect, some traits may not be intuitive


ht://Dig traits

ht://Dig indexes words, weighting them based on position in text and HTML elements. The URLs where the word or combination of words searched for is found, gets points based on the weight of the locations of those words in the URL.

ht://Dig has less functionality compared to other popular search engines, like AltaVista.
Some of its most obvious limits are:


System limits of this installation

Some of these are necessary because of the CGI resource limit policy of my IPP, pair networks; some are further restricted. This means that you may see an error-page excusing for the shortcoming, instead of a results-page. If the search time would have exceeded 4 minutes, a retry may be successful.


This ht://Dig configuration


The indexed mailing list contents

The old hypermail setup for the egcs mailing lists (up to February 1999) saved the messages in something that is a mix between the original message and HTML. Text that contains characters associated with markup is not straightforwardly viewable in a browser, but worse, a search engine will miss and misinterpret information. Fortunately, it is easy to parse this pseudo-html, and pick out the information that is worth indexing. An external parser is used for that purpose in this setup.

Things became a little bit easier with MHonArc, although the need for specific parsing still applies, for example to avoid indexing attachments and irrelevant parts of message headers.

There is a difference in what is shown in the excerpt from the older hypermail archives and the newer MHonArch ones, but all in all a difference that I believe is not worth fixing. The hypermail based ones had author name, email address and time of message, while the newer ones just have the author name.


Last modified: April 24, 1999
Complaints to webmaster@bitrange.com