To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8632
8631  |  8633
Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Thu, 4 Jan 2001 05:07:43 GMT
Viewed: 
1220 times
  
In lugnet.admin.general, David Eaton writes:
As an algorithmical guess, I think I'd probably attempt something a bit
different... If someone enters:
   the
I'd probably want to ignore it. But if they entered:
   the best design
I might want to consider the 'the'. Dunno. I'd probably test an algorithm
that ignored all stopwords in the initial search, but then for each result
of the initial search, score it with respect to any stopwords found (&
proximity, etc).

Oh -- actually, what search engines typically do on queries (and I just
finally added this last week) is downvalue relatively common words and
upvalue relatively uncommon words -- what's called "term ranking" or "term
weighting."  For example, when you search for "lego duplo", there are
(currently) about 65,000 documents accounting for about 140,000 word-hits
of "lego" and only about 1500 documents accounting for only about 3300
word-hits of "duplo".  So the importance of the word "duplo" is very high
relative to the word "lego".  Similarly, there are tons of articles with
"david" but only about one fifth as many with "eaton" -- so a search for
"david eaton" has an easier time finding David Eaton among all the other
Davids.


The only things here that've given me problems (or that I'd expect might
give others problems [apart from language differences]) are http and www.
And now that I look at it, I've got to ask, if I specified "http" as a
search parameter, would it in fact score every post made with the webserver
as having an instance thanks to the X-Nntp-Gateway?  Same with www, I
suppose, except that all new webserver posts are from 'news.lugnet.com'
instead-- but all older posts would show up? Or are various header info
ignored?

Right.  It only includes these headers in the indexed text:

   X-Real-Life-Name:
   Original-From:
   From:
   Subject:
   Keywords:
   Summary:

Other headers are ignored.  Quoted content is also ignored.  And it also does
its best to ignore lines like "on such and such a date, so and so wrote..."
and sigs.


Very cool! You mention later the prospect of having 'folders' for things
like 'cool sites' or 'space models' or 'great building ideas' for members
using the web-interface... Would this be something user-configurable?

Yes.


Could I create as many folders as I would want, or would there be a set
number of X folders that would be pre-determined?

Unread, Read, Save, and Trash would probably be the only predefined folders.
All the default names would be renamable to whatever name you want.  (I don't
want them to have to be in English.)  You'll prolly be able to set any
bitfields you want on any folder.  Bitfield options would be things like
"this is the trash folder" and "this is an incoming folder."  The Trash folder
would simply be a folder named "Trash" with the "this is a trash folder"
property set.  The Unread folder would simply be a folder named "Unread" that
was configured to receive new articles from some mix of groups you defined.
As soon as you read an article, it would hop to some other folder -- e.g.,
the Read folder or wherever.  The Save folder wouldn't be anything special
-- just a regular folder.  If you deleted something, it would go to the Trash
folder.  If you delete something from the Trash or empty the Trash, it goes
away for good (away from your personal lists, that is...it would still be on
the newsserver, naturally).


Either way (assuming people wouldn't make hundreds of folders for
themselves) it sounds like there's not a space concern (my worst estimates
of millions of posts and thousands of users weren't all that bad
considering).

Ya, with the right data structures, space isn't an issue.  Sparse bitfields
and ID lists are the way to go.

--Todd


And as long as the interface isn't doing something dynamic
with the folders (making little action icons for each folder on article
listing pages, etc), sounds like there wouldn't be a real time strain
either... A very cool idea!

DaveE



Message is in Reply To:
  Re: News search function reactivated (was: News search function temporarily disabled)
 
(...) As an algorithmical guess, I think I'd probably attempt something a bit different... If someone enters: the I'd probably want to ignore it. But if they entered: the best design I might want to consider the 'the'. Dunno. I'd probably test an (...) (23 years ago, 3-Jan-01, to lugnet.admin.general)

45 Messages in This Thread:
















Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR