Subject:
|
Re: News search function reactivated (was: News search function temporarily disabled)
|
Newsgroups:
|
lugnet.admin.general
|
Date:
|
Wed, 3 Jan 2001 02:48:21 GMT
|
Viewed:
|
1501 times
|
| |
| |
In lugnet.admin.general, David Eaton writes:
> In lugnet.admin.general, Todd Lehman writes:
> > Is that the sort of functionality you're looking for?
>
> Pretty much... actually, I was thinking more along the lines of an advanced
> search form though:
>
> Search for: ______________________ (uses +'s and -'s as is)
> Search for text in subject line [] (checkbox)
> Posted by: _______________________ (uses +'s and -'s... or no symbols, too)
> Search only for heads of a thread [] (checkbox)
> Posted before: ___ /____ / ____
> Posted after: ___ / ____/ ____
>
> But throwing in symbols/wildcards on the command line instead of in a form
> works for me too :)
Ya, something like that'd be good to slap on top after the base functionality.
:-) Nobody wants to *have* to remember how all the squiggly and square brace
thingums in a search box work. :-)
> > If you search for
> >
> > david eaton <10
> >
> > then it'll show you things matching "david eaton" that were posted "about 10
> > days ago" (plus or minus 10 days -- with higher matches given to those that
> > are closer to the 10-days-ago-point).
>
> So-- looks like if today is day 100, "david eaton <10" would search days
> 80-100, giving highest precedence to things closest to day 90...
Ya, precisely. It was originally (summer of '99) a smooth bell-shaped curve
y = exp(-.5 * x^2)
(x being amount of deviation from the target and y being the output function
giving the fitness value) but that was chewing up 10^-6 seconds in one of the
tight inner loops (i.e., wasting 0.07 CPU seconds on a word like 'lego' with
~70000 hits), so I threw that out and changed it to a linear 1-|x| shaped spike
curve
y = max(0, 1-|x|)
instead. That doubled the overall throughput and still gave decent results.
The main advantage of the bell curve y=exp(-.5*x^2) is that y>0 for all x,
but that can also be a disadvantage. The sharp y=max(0,1-|x|) curve has a
nice sudden cutoff at x<1 and x>1. :-) In a way I'm kinda glad the bell
curve was so slow to compute
> cool! I
> think that's actually really useful :)
>
> > I'm thinkin' this'll be quite useful for digging up stuff that's "about a
> > week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's
> > tough to remember an exact date.
>
> Poifect! That's something I had been wanting for a while since I'll know
> (for example) that I posted something last spring, but I want to make sure
> NOT to search for anything after, say, June, or before March... very cool
> indeed :)
Ya, I can't count the number of times I've wanted to go about "about so many
days or weeks" to look for something. I'll remember something and not know
exactly what date it was posted, but I'll remember roughly how long ago it
was. So it's effectively doing a fuzzy date search with variable focus (wide,
narrow, etc.).
> > > - Search for articles containing a URL
> > Not sure how to handle this yet.
> Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm
> looking for someone who posted a link to their site... occasionally I've
> tried to do this by putting "http" on the query string, although it doesn't
> rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html".
I'm not sure why I included "http" in the stopword list. I apologize for
that. (Probably because it would have generated zillions upon zillions of
word hits. "com" is the most frequently indexed word here.)
"http" could certainly stand to go back in now that the query engine is so
much faster. Maybe even other words like "it" and "that" and "the."
Here's a list of stopwords, BTW...are there any you see here that stand out
in your mind as having given you problems in the past?
a an the it its it's this that what
i i'm im my we me us you
do be am is are was can has
of for from with to in out on off at as if and but or not no have so
http www
One thing I *really* hate about stopword lists is that they're so language-
centric (i.e., a stopword in one language might be a darn-tootin' regular
good word in another language).
Also, all single-letter words are ignored (i.e., "a" and "i" for English and
"y" for Spanish, etc.).
> > > - option to ONLY return the heads of threads
> > Ahh yes -- I'll put that on the "must do" list. That won't be too hard once
> > the thread lists are generated internally for other purposes.
> Cool!
I've planned ahead here. For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups. That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter. In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.
--Todd
|
|
Message has 2 Replies:
Message is in Reply To:
45 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|