To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8619
8618  |  8620
Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 02:48:21 GMT
Viewed: 
1311 times
  
In lugnet.admin.general, David Eaton writes:
In lugnet.admin.general, Todd Lehman writes:
Is that the sort of functionality you're looking for?

Pretty much... actually, I was thinking more along the lines of an advanced
search form though:

Search for: ______________________ (uses +'s and -'s as is)
Search for text in subject line [] (checkbox)
Posted by: _______________________ (uses +'s and -'s... or no symbols, too)
Search only for heads of a thread [] (checkbox)
Posted before: ___ /____ / ____
Posted after: ___ / ____/ ____

But throwing in symbols/wildcards on the command line instead of in a form
works for me too :)

Ya, something like that'd be good to slap on top after the base functionality.
:-)  Nobody wants to *have* to remember how all the squiggly and square brace
thingums in a search box work.  :-)


If you search for

  david eaton <10

then it'll show you things matching "david eaton" that were posted "about 10
days ago" (plus or minus 10 days -- with higher matches given to those that
are closer to the 10-days-ago-point).

So-- looks like if today is day 100, "david eaton <10" would search days
80-100, giving highest precedence to things closest to day 90...

Ya, precisely.  It was originally (summer of '99) a smooth bell-shaped curve

   y = exp(-.5 * x^2)

(x being amount of deviation from the target and y being the output function
giving the fitness value) but that was chewing up 10^-6 seconds in one of the
tight inner loops (i.e., wasting 0.07 CPU seconds on a word like 'lego' with
~70000 hits), so I threw that out and changed it to a linear 1-|x| shaped spike
curve

   y = max(0, 1-|x|)

instead.  That doubled the overall throughput and still gave decent results.

The main advantage of the bell curve y=exp(-.5*x^2) is that y>0 for all x,
but that can also be a disadvantage.  The sharp y=max(0,1-|x|) curve has a
nice sudden cutoff at x<1 and x>1.  :-)  In a way I'm kinda glad the bell
curve was so slow to compute


cool! I
think that's actually really useful :)

I'm thinkin' this'll be quite useful for digging up stuff that's "about a
week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's
tough to remember an exact date.

Poifect! That's something I had been wanting for a while since I'll know
(for example) that I posted something last spring, but I want to make sure
NOT to search for anything after, say, June, or before March... very cool
indeed :)

Ya, I can't count the number of times I've wanted to go about "about so many
days or weeks" to look for something.  I'll remember something and not know
exactly what date it was posted, but I'll remember roughly how long ago it
was.  So it's effectively doing a fuzzy date search with variable focus (wide,
narrow, etc.).


- Search for articles containing a URL
Not sure how to handle this yet.
Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm
looking for someone who posted a link to their site... occasionally I've
tried to do this by putting "http" on the query string, although it doesn't
rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html".

I'm not sure why I included "http" in the stopword list.  I apologize for
that.  (Probably because it would have generated zillions upon zillions of
word hits.  "com" is the most frequently indexed word here.)

"http" could certainly stand to go back in now that the query engine is so
much faster.  Maybe even other words like "it" and "that" and "the."

Here's a list of stopwords, BTW...are there any you see here that stand out
in your mind as having given you problems in the past?

   a an the it its it's this that what
   i i'm im my we me us you
   do be am is are was can has
   of for from with to in out on off at as if and but or not no have so
   http www

One thing I *really* hate about stopword lists is that they're so language-
centric (i.e., a stopword in one language might be a darn-tootin' regular
good word in another language).

Also, all single-letter words are ignored (i.e., "a" and "i" for English and
"y" for Spanish, etc.).


- option to ONLY return the heads of threads
Ahh yes -- I'll put that on the "must do" list.  That won't be too hard once
the thread lists are generated internally for other purposes.
Cool!

I've planned ahead here.  For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups.  That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter.  In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.

--Todd



Message has 2 Replies:
  Article bit-flags (was: Re: News search function reactivated)
 
(...) Oh, one other thing...planning ahead: Another potential application of article bit-flags is read/unread lists on a person-by-person basis via the web interface. I know this is something that people have been asking for for a long time. When (...) (23 years ago, 3-Jan-01, to lugnet.admin.general)
  Re: News search function reactivated (was: News search function temporarily disabled)
 
(...) As an algorithmical guess, I think I'd probably attempt something a bit different... If someone enters: the I'd probably want to ignore it. But if they entered: the best design I might want to consider the 'the'. Dunno. I'd probably test an (...) (23 years ago, 3-Jan-01, to lugnet.admin.general)

Message is in Reply To:
  Re: News search function reactivated (was: News search function temporarily disabled)
 
(...) Pretty much... actually, I was thinking more along the lines of an advanced search form though: Search for: ___...___ (uses +'s and -'s as is) Search for text in subject line [] (checkbox) Posted by: ___...___ (uses +'s and -'s... or no (...) (23 years ago, 2-Jan-01, to lugnet.admin.general)

45 Messages in This Thread:
















Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR