Subject:
|
Re: Web interface search results
|
Newsgroups:
|
lugnet.admin.general
|
Date:
|
Tue, 24 Aug 1999 10:22:31 GMT
|
Viewed:
|
1668 times
|
| |
| |
This is an append to a thread started about 8 months ago...
In lugnet.admin.general, Todd Lehman writes:
> Jeremy Sproat <jsproat@geocities.com> writes:
> > Would it be feasable to institute some kind of search language, to
> > search for articles within a date range, within a certain newsgroup, or
> > from a certain poster?
>
> I'll need to make something that indexes all the articles by date/time
> stamp and by poster, but it can be done.
OK, Jeremy!
What I did recently as a background task was rebuilt the news search
database from scracth, this time including article timestamps and increased
emphasis on the poster's name.
> I assume you'd want to search
> by someone's real-life name rather than their news-posting name/address,
> right? -- because some people post from multiple addresses. OTOH, some
> people have the same name as other people; maybe being able to search
> both ways is needed.
If you go to the root homepage and type someone's name, you'll see that
the first (best) matches are typically articles posted by that person,
followed by lesser matches which only happen to contain the person's name
in the body of the article. It's a hybrid interleaved weighted matcher
which gives words a higher emphasis the closer they are to the beginning
of the document (sum of 1/(10+n) where n is a hit) but which also takes
the age of the article into account.
With the timestamp ordering, you'll notice that the topmost articles also
tend to be the most recent ones. This is because the "target" timeframe is
set to "very recent" by default and the ranker is told that the timeframe is
moderately important (important enough to bias the results toward newer-
first, but not so important as to make the relative term hits irrelevant).
The data structures produced by the indexer do contain enough information
to filter articles on hard date boundaries, but while date ranges are
popular at DejaNews, I always found them incredibly frustrating. So I set
the timestamp ranker up to be a _fuzzy_ match. In an advanced search,
you'll be able to say: find articles with "orange halloween buckets" that
were posted "about a year ago," and that will cause the ranker to favor
older articles over than newer articles. If there were articles 12 months
old that matched, then great! -- but if not, then any matching articles 8
months old would still show up (but with a lower score).
Similarly, you might want to look something up that you remember a couple
keywords from going back about 2 or 3 weeks. If you can't remember the
exact date, you could tell the search engine to produce results with a
target of "about 3 weeks ago" and any matches in that approximate time
range would be ranked higher than older or newer matches.
To achieve a smooth match maching function, the fuzzy timestamp ranking
function uses a bell-shaped curve
y = exp(-.5 * ((delta / sigma) ^ 2)
where 'delta' is the difference between the "target" time and a given
article's timestamp and 'sigma' is the "time uncertainty factor." So a
relatively large sigma produces wide time spreads around the target, and
a small sigma produces fine time spreads. Both infinitely large and
infinitely small sigma cause the timestamps to be a non-factor.
The default target time is "right now" (i.e., right when you invoke the
search function) and the default sigma is 1 week, which gives extremely
fresh articles a score boost of 1.0, and articles 1 week old a boost of
.6065, articles 2 weeks old .1353, articles 3 weeks old .1111, etc. The
extra "umph" kicked in by the target-time-matching approaches zero
asymptotically as the time differential approaches infinity. It seems to
work pretty nicely with sigma set to 1 week by default, although I'd like
to make this configurable in an advanced search later, as well as possibly
an option for hard (non-fuzzy) date filtering.
> Searching within a certain newsgroup, however, is already there (it's been
> there from day 1 of the search function) -- check out:
>
> http://www.lugnet.com/news/search/
>
> For the "in Groups" field you can type things like
>
> lugnet.*
> market.*
> loc.uk.*
> robotics
>
> (Note that it automatically prepends "lugnet." onto the front for you if
> you omit it.)
The /news/search/ page went away with the new website reorg last month, but
it might be useful again if it came back as an advanced search page. Note
that searching within categories is now as easy as being in a category and
initiating a search right from there, rather than having to go to a special
search page and type in gunky regex pattern. (And it was never a real full-
blown regex matcher anyway -- all it understood was a single '*' since its
only purpose in life was to help filter categories.) Of course this part
could always also come back for an advanced search, since the low-level
group filter uses a regex internally. It's just a matter of identifying
which options are priorities and which ones are merely feeping creaturitis.
A recent change in the group filters, BTW, was to fix the searching of
crossposted articles. Before, if something was crossposted as
Newsgroups: lugnet.loc.us.ca.sf,lugnet.general,lugnet.announce
then it would only show up in searches of lugnet.loc.us.ca.sf or higher --
not in lugnet.general or lugnet.announce (d'Oh!). This misfeature was the
result a misguided attempt at saving space in the index database long before
the notion of restricting the searches to subgroups came up. The indexer now
uses a clever and efficient encoding (space, time, and code!) mechanism to
include crossposting information. :-)
> > Also, the results page currently lists: something(?), the article
> > number, and words found on each line. I would also like to be able to
> > see the articles' subject lines and dates posted. You know, to make ego
> > surfing easier. :-,
>
> I didn't have a little library for news articles when I wrote the search
> function the first time, so that's why it's so bare-bones. :) But let
> me see what I can do now. The other filterings will take a more work,
> but this is probably pretty straightforward -- I'll allocate a 1/2 hour
> for it today -- seems like a pretty useful thing now that the searches
> are returning many more results than they used to.
I forget exactly what happened after the above, but whatever it was, it was
tossed out when the new "real" interface was moved into place a few weeks
ago (the one which encases articles in 250-word snippet/preview boxes).
BTW, you can also now use '+' and '-' like AltaVista. (I probably mentioned
this a few weeks ago already, but oh well, here goes again.)
If you enter "+foo" then it means that "foo must exist in the matches; don't
show me things that don't contain foo." If you enter "-foo" then it means
that "foo mustn't exist in the matches; don't show me anything that contains
foo." For example:
jeremy sproat droideka (1354 matches)
jeremy sproat -droideka (1346 matches)
jeremy sproat +droideka (8 matches)
+droideka -jeremy -sproat (2 matches)
Welp, there you have it. Lemme know if it rocks or if it sucks.
--Todd
|
|
Message has 1 Reply: | | Re: Web interface search results
|
| (...) Todd, you are a lean, mean, re-coding machine! :-, I like the date indexing, but I haven't been able to figure out how to use it; e.g., what is the syntax to, say, modify the query to prefer articles from about four months ago? (...) That'd be (...) (25 years ago, 24-Aug-99, to lugnet.admin.general)
|
Message is in Reply To:
| | Re: Web interface search results
|
| (...) I'll need to make something that indexes all the articles by date/time stamp and by poster, but it can be done. I assume you'd want to search by someone's real-life name rather than their news-posting name/address, right? -- because some (...) (26 years ago, 5-Jan-99, to lugnet.admin.general)
|
16 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|