Subject:
|
Re: News search function reactivated (was: News search function temporarily disabled)
|
Newsgroups:
|
lugnet.admin.general
|
Date:
|
Wed, 3 Jan 2001 14:44:55 GMT
|
Viewed:
|
1290 times
|
| |
| |
In lugnet.admin.general, Todd Lehman writes:
> I'm not sure why I included "http" in the stopword list. I apologize for
> that. (Probably because it would have generated zillions upon zillions of
> word hits. "com" is the most frequently indexed word here.)
>
> "http" could certainly stand to go back in now that the query engine is so
> much faster. Maybe even other words like "it" and "that" and "the."
As an algorithmical guess, I think I'd probably attempt something a bit
different... If someone enters:
the
I'd probably want to ignore it. But if they entered:
the best design
I might want to consider the 'the'. Dunno. I'd probably test an algorithm
that ignored all stopwords in the initial search, but then for each result
of the initial search, score it with respect to any stopwords found (&
proximity, etc).
> Here's a list of stopwords, BTW...are there any you see here that stand out
> in your mind as having given you problems in the past?
>
> a an the it its it's this that what
> i i'm im my we me us you
> do be am is are was can has
> of for from with to in out on off at as if and but or not no have so
> http www
The only things here that've given me problems (or that I'd expect might
give others problems [apart from language differences]) are http and www.
And now that I look at it, I've got to ask, if I specified "http" as a
search parameter, would it in fact score every post made with the webserver
as having an instance thanks to the X-Nntp-Gateway? Same with www, I
suppose, except that all new webserver posts are from 'news.lugnet.com'
instead-- but all older posts would show up? Or are various header info ignored?
> One thing I *really* hate about stopword lists is that they're so language-
> centric (i.e., a stopword in one language might be a darn-tootin' regular
> good word in another language).
It does present a rather unique problem (unless the search proves capable of
handling stopwords)... Fortunately the vast majority of posts seem to be in
one language for now, even though there has been a distinct increase in some
other languages in the last year and a half or so :)
> I've planned ahead here. For each group, there'll be a list of articles that
> comprise the heads-of-threads for those groups. That list can the be used to
> generate more compact views into the group, or it can also be fed into the
> query engine as an "include only these" filter. In memory, once loaded, the
> article filter lists are 1-bit flags -- 1 bit per article position -- so even
> a list of a quarter million articles consumes only 30 KB of memory for the
> fraction of a second that it's needed.
Very cool! You mention later the prospect of having 'folders' for things
like 'cool sites' or 'space models' or 'great building ideas' for members
using the web-interface... Would this be something user-configurable? Could
I create as many folders as I would want, or would there be a set number of
X folders that would be pre-determined?
Either way (assuming people wouldn't make hundreds of folders for
themselves) it sounds like there's not a space concern (my worst estimates
of millions of posts and thousands of users weren't all that bad
considering). And as long as the interface isn't doing something dynamic
with the folders (making little action icons for each folder on article
listing pages, etc), sounds like there wouldn't be a real time strain
either... A very cool idea!
DaveE
|
|
Message has 1 Reply:
Message is in Reply To:
45 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|