Subject:
|
Re: Search for "BLOCKQUOTE" in lugnet.faq
|
Newsgroups:
|
lugnet.admin.general
|
Date:
|
Sat, 21 Aug 1999 10:14:14 GMT
|
Viewed:
|
363 times
|
| |
| |
In lugnet.admin.general, Todd Lehman writes:
> In lugnet.admin.general, "Robert Munafo" <munafo@gcctech.com> writes:
> > [...]
> > I'm not too concerned about being able to search for non-alphanumeric
> > characters in LUGNET messages, except maybe the hyphen issue someone else
> > has brought up. However, alphanumeric words like "BLOCKQUOTE" should be
> > findable even if they have non-alphanumeric characters adjacent to them,
> > as long as it's something the author originally typed in their message.
>
> Agreed! Again, I'm sorry for the "bug" and I don't know what I was
> thinking when I told the indexer to ignore HTML tags. I'll fix this when
> I re-work the indexer and add timestamps to the index.
I'm tinkering with the indexer a bit and I think I remember now what I
was thinking when I told it to ignore HTML tags before. I think I wasn't
trying to filter out HTML... I think I was trying to filter out message
IDs! (Durnit for not having written that in a comment.)
Because message IDs a real bummer of garbage as far as indexing is concerned.
There are quite a few messages with lines like:
Frank Filz wrote in message <37A9B055.60E3@mindspring.com>...
Larry Pieniazek wrote in message <37BAD526.FADBED5A@voyager.net>...
and they just make clutter in the index because they're random strings.
I must've figured that filtering out /<.*?>/ would have the benevolent side-
effect of filtering out HTML. Of course, a better regex for message ID's
is /<.*?\@.*?>/. Unfortunately, that still filters out e-mail addresses
written in angle brackets. Probably better to match on
m/^(.*)?wrote in message <.*?\@.*?>.*$/
and keep $1, throwing away the rest of the line. Or something like that.
--Todd
|
|
Message is in Reply To:
7 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|