To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8627
8626  |  8628
Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 14:44:55 GMT
Viewed: 
1290 times
  
In lugnet.admin.general, Todd Lehman writes:
I'm not sure why I included "http" in the stopword list.  I apologize for
that.  (Probably because it would have generated zillions upon zillions of
word hits.  "com" is the most frequently indexed word here.)

"http" could certainly stand to go back in now that the query engine is so
much faster.  Maybe even other words like "it" and "that" and "the."

As an algorithmical guess, I think I'd probably attempt something a bit
different... If someone enters:

the

I'd probably want to ignore it. But if they entered:

the best design

I might want to consider the 'the'. Dunno. I'd probably test an algorithm
that ignored all stopwords in the initial search, but then for each result
of the initial search, score it with respect to any stopwords found (&
proximity, etc).

Here's a list of stopwords, BTW...are there any you see here that stand out
in your mind as having given you problems in the past?

  a an the it its it's this that what
  i i'm im my we me us you
  do be am is are was can has
  of for from with to in out on off at as if and but or not no have so
  http www

The only things here that've given me problems (or that I'd expect might
give others problems [apart from language differences]) are http and www.
And now that I look at it, I've got to ask, if I specified "http" as a
search parameter, would it in fact score every post made with the webserver
as having an instance thanks to the X-Nntp-Gateway? Same with www, I
suppose, except that all new webserver posts are from 'news.lugnet.com'
instead-- but all older posts would show up? Or are various header info ignored?

One thing I *really* hate about stopword lists is that they're so language-
centric (i.e., a stopword in one language might be a darn-tootin' regular
good word in another language).

It does present a rather unique problem (unless the search proves capable of
handling stopwords)... Fortunately the vast majority of posts seem to be in
one language for now, even though there has been a distinct increase in some
other languages in the last year and a half or so :)

I've planned ahead here.  For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups.  That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter.  In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.

Very cool! You mention later the prospect of having 'folders' for things
like 'cool sites' or 'space models' or 'great building ideas' for members
using the web-interface... Would this be something user-configurable? Could
I create as many folders as I would want, or would there be a set number of
X folders that would be pre-determined?

Either way (assuming people wouldn't make hundreds of folders for
themselves) it sounds like there's not a space concern (my worst estimates
of millions of posts and thousands of users weren't all that bad
considering). And as long as the interface isn't doing something dynamic
with the folders (making little action icons for each folder on article
listing pages, etc), sounds like there wouldn't be a real time strain
either... A very cool idea!

DaveE



Message has 1 Reply:
  Re: News search function reactivated (was: News search function temporarily disabled)
 
(...) Oh -- actually, what search engines typically do on queries (and I just finally added this last week) is downvalue relatively common words and upvalue relatively uncommon words -- what's called "term ranking" or "term weighting." For example, (...) (24 years ago, 4-Jan-01, to lugnet.admin.general)

Message is in Reply To:
  Re: News search function reactivated (was: News search function temporarily disabled)
 
(...) Ya, something like that'd be good to slap on top after the base functionality. :-) Nobody wants to *have* to remember how all the squiggly and square brace thingums in a search box work. :-) (...) Ya, precisely. It was originally (summer of (...) (24 years ago, 3-Jan-01, to lugnet.admin.general)

45 Messages in This Thread:
















Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR