Some limits are in effect, some traits may not be intuitive
ht://Dig indexes words, weighting them based on position in
text and HTML elements. The URLs where the word or
combination of words searched for is found, gets points
based on the weight of the locations of those words in the URL.
ht://Dig has less functionality compared to other popular search
engines, like AltaVista.
Some of its most obvious limits are:
- No search on "phrases"; consecutive words.
- There is no "near" function in search methods.
- Matching punctuation is not possible.
- The "not" operator in the boolean search method is
non-intuitive and does not work as in other search
engines. In ht://Dig, it is a binary operator and behaves
like it was named "without".
System limits of this installation
Some of these are necessary because of the
CGI
resource limit policy of my IPP,
pair networks; some are
further restricted.
- Not more than 3 Mbyte data
- of which 1 Mbyte can be locked.
- No search must take more than 10 CPU seconds.
- No search must take longer than 4 "real" minutes.
This means that you may see an error-page excusing
for the shortcoming, instead of a results-page. If the
search time would have exceeded 4 minutes, a retry may be
successful.
This ht://Dig configuration
- The saved excerpts (seen in search results of the default
"long" type) are only 200 bytes long.
- Words consist of alphanumeric characters and the
_ character.
- Punctuation characters delimit words.
The only folded punctuation character (the valid_punctuation
attribute) is the * character.
- Four-digit numbers
and decimal numbers starting with zero are not searchable.
- Leading or trailing _ characters are removed
from words before indexing; you need to remove them when searching.
- Only the first twelve characters in a word are
significant. This is the default compile-time limit of ht://Dig.
- Many words are listed as
"bad"; they will not be searchable because they
are too common to be useful. Words two characters or less
are not searchable.
- Attachments (as far as recognized) are not indexed,
except for the first that has MIME-type text/plain
or text/html.
- Data which is uuencoded (whether in attachments or not)
is not indexed.
- The mail-headers, as far as recognized, are not searchable.
- The mail-headers are not included in the excerpt.
- Some messages are not indexed, due to the commercial
content without obvious egcs-related interest (a.k.a. spam (TM)).
- Others are not indexed because they contain just about
nothing but faulty source code or a large attachment that
was not recognized by other machinery.
- Patches (unified and context, generated by for example,
GNU diff or CVS) are not indexed. The rationale is that it is
very unlikely that information in the patch is interesting
enough for indexing, while containing lots of new "words";
noise. To index the patches would be equal to indexing the
sources, which is not the goal here; there are better
methods if that's what you're looking for; the grep
and etags programs come to mind. The
ChangeLogs that come in plaintext with the patches
are of course indexed, so any need for history-browsing of changes
should be catered for.
- As far as recognized, quotes of original messages in
reply's are considered not as important as the reply, and
may or may not be included before the reply in the excerpt.
- Only the exact
search algorithm is used. No synonyms or endings.
- The database is updated sporadically, most often once a week.
- There may be bugs in the setup.
The ones that cause the search to come up with an internal
search engine error are easily spotted and tracked. I
fix those as soon as possible, and list progress on the timeline page.
The indexed mailing list contents
The old hypermail setup for the egcs mailing lists (up to
February 1999) saved the
messages in something that is a mix between the original
message and HTML. Text that contains characters associated
with markup is not
straightforwardly viewable in a browser, but worse, a search
engine will miss and misinterpret information. Fortunately,
it is easy to parse this pseudo-html, and pick out the
information that is worth indexing. An external parser is
used for that purpose in this setup.
Things became a little bit easier with MHonArc,
although the need for specific parsing still applies, for
example to avoid indexing attachments and irrelevant parts
of message headers.
There is a difference in what is shown in the excerpt
from the older hypermail archives and the newer MHonArch
ones, but all in all a difference that I believe is not
worth fixing. The hypermail based ones had author name,
email address and time of message, while the newer ones just
have the author name.
Last modified: April 24, 1999
Complaints to webmaster@bitrange.com