To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8457
Subject: 
News search function temporarily disabled
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek, lugnet.announce
Followup-To: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Mon, 11 Dec 2000 17:34:24 GMT
Highlighted: 
! (details)
Viewed: 
3031 times
  
The system crashed again.  Either it's running out of memory and that's
causing a downward spiral or it's running out of CPU cycles and starving
enough processes to build up and cause a meltdown.  Either way some tuning
needs to be done.

This is going to be my top LUGNET priority until things stabilize -- they've
been shaky for the past couple of weeks.  I need to find the bottlenecks and
eliminate them.

As a first step, I've turned off the text-search function for news.  I
apologize for any inconvenience this may cause, but it will help me know
how much of a bottleneck it was causing.  Its algorithm for merging results
is grossly inadequate for some of the queries it processes.

As a second step, I'm going to install a monitoring log that records the
elapsed time and CPU time of every dynamically generated webpage.  From
those results, the bottlenecks will stand out like a fistful of sore thumbs.
(Fixing them is a different matter.)

I'm taking lunch at work right now; I'll be able to install the monitoring
script quickly and then I'll have a look at the stored results later tonight.
I probably won't be able to re-activate the text-search function of news
until after replacing it with something that has better worst-case
performance.  There's a chance I may have to disable other things, like the
Mosaic Maker (if it's being used and contributing to the bottleneck -- I
don't know yet whether it is).

Sometime before the end of December, I plan also to upgrade the RAM on the
system.  In Q1 of 2001, we'll add another physical box as well.

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Wed, 13 Dec 2000 23:34:43 GMT
Viewed: 
642 times
  
<nod nod> Thanks for letting us know.

I'm no geek in any means, but first things first, I would limit the search.
Sometimes a single search brings up thousands of posts. I suggest if the
search is too broad, say it and make the searcher narrow it down a bit.

HTH, and keep up the good work,
-Shiri


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Sun, 17 Dec 2000 01:26:09 GMT
Viewed: 
684 times
  
In lugnet.admin.general, Todd Lehman writes:

As a first step, I've turned off the text-search function for news.  I
apologize for any inconvenience this may cause, but it will help me know
how much of a bottleneck it was causing.  Its algorithm for merging results
is grossly inadequate for some of the queries it processes.

Hi Todd,

Did you ever consider a google search box?  Like this one:
www.sis.pitt.edu/~dist

I don't know for sure, but I think that the search is done on google's server.

Toki


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Sun, 17 Dec 2000 01:36:12 GMT
Viewed: 
722 times
  
In lugnet.admin.general, Toki Barron writes:
Hi Todd,
Did you ever consider a google search box?  Like this one:
www.sis.pitt.edu/~dist

That's an intriguing thought.  It looks like it requires Google to be able
to crawl a site completely, though.  I wonder if they play well with
dynamically generated content...

I don't know for sure, but I think that the search is done on google's
server.

Yes, it is.

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Sun, 17 Dec 2000 01:56:53 GMT
Viewed: 
758 times
  
Todd Lehman wrote:

In lugnet.admin.general, Toki Barron writes:
Hi Todd,
Did you ever consider a google search box?  Like this one:
www.sis.pitt.edu/~dist

That's an intriguing thought.  It looks like it requires Google to be able
to crawl a site completely, though.  I wonder if they play well with
dynamically generated content...

I've had nothing but problems with www.hort.net/gallery/, which is all
dynamically generated.  Every month at about the same time we disappear
entirely out of Google's search engine, which they claim is due to updates.
Although it usually only lasts a couple of days, the last one has lasted
about six weeks.  They can't seem to explain why it happens, either.

Anyhow, I guess my point is that I don't find them to be particularly
reliable.

Chris


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Sun, 17 Dec 2000 17:34:18 GMT
Viewed: 
776 times
  
In lugnet.admin.general, Todd Lehman writes:
[...]
As a second step, I'm going to install a monitoring log that records the
elapsed time and CPU time of every dynamically generated webpage.  From
those results, the bottlenecks will stand out like a fistful of sore thumbs.
(Fixing them is a different matter.)
[...]

Certain bottlenecks are indeed standing out like sore thumbs.  One big one is
the dynamic generation of the /shop/ pages on guide.lugnet.com, and another
is the dynamic generation of member-specific set lists.  An even bigger one
is (was) the dynamic generation of member listings (people) such as:

   http://www.lugnet.com/people/members/byfirstname.cgi

That was taking several seconds of CPU and in some cases 200+ seconds of
transmission time to people on slow modems.  I can't do anything about the
transmission time without reducing the page size, but for now I changed it
to cache the content for faster page generation.  Holding a large chunk of
data in memory for 200 elapsed-time seconds while generating a page is bad,
even if only a fraction of a CPU second is used.

Another bottleneck is that there seems to be a strong correlation between the
CPU time to generate a page for a LEGO set and the number of people who've
given comments or supplied personal data about the set.  For example, this
page:

   http://guide.lugnet.com/set/7130

consumes 3 times as many CPU cycles as this page:

   http://guide.lugnet.com/set/1462

or about 50 times as many CPU cycles above and beyond the regular overhead
for a set page.

A quick gnuplot graph with number of people on the x-axis and CPU time on
the y-axis showed a nearly straight line correlation.  That's good info for
me to know.  It ought to be is a gradual logarithmic-looking curve rather
than a straight line.

This isn't anyone's fault[1] and it doesn't mean that people need to stop
or entering data -- it just means there's an identifiable inefficiency which
I need to find a way to eliminate.  A bottleneck like this won't get better
with more RAM or a faster CPU -- it's a coding issue in need of algorithmic
adjustments rather than quantitative system tuning.

Some of these bottlenecks amount to a significant portion of the CPU-cycles
pie.  For example, the member listings consume(d) significant CPU cycles but
aren't accessed anywhere near as often as the set guide pages; the bottleneck
in the set guide pages is small, but adds up to a lot of wasted CPU cycles
overall.  Across all dynamically generated pages served, the single-set-view
guide pages together account for 23% and multi-set-view guide pages together
account for 18%.  I estimate that one to two hours per day of CPU cycles are
currently being wasted due to this bottleneck alone (based on a desired
page-generation time of 0.2 seconds (average) per page).

The news homepage <http://news.lugnet.com/> is also a bottleneck, averaging
a very embarrassing 0.537 seconds of CPU cycles per invocation, but in the
end it actually amounts to only about 1000 seconds of CPU time per day, so
it's actually not a true bottleneck at its current rate of request (~2000
page views per day).  It *could* turn out to be a bottleneck someday if the
usage went up significantly from where it is during peak hours, but for now
it's not a real issue.

The main LUGNET homepage <http://www.lugnet.com/> sees two to three times
as much traffic as the LUGNET News homepage but is also three times more
efficient in terms of memory and CPU cycles, so I'm not even remotely
worried about it yet.  Everything else falls into group of page types.

The biggies for now are the set guide pages and, of course, bringing back
the news search functionality.  I think I can live with the inefficiencies
of the set guide pages for a few more weeks, because their worst-case
performance is still deterministic.  The worst-case performance of the
news-search function, however, was non-deterministic.

I think I know of a way to make the search much more memory-efficient and
later more time-efficient without radically ripping things apart.  If a
miracle occurs, the news search function might be back sometime this week.

--Todd

[1] except mine, of course


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Sun, 17 Dec 2000 23:21:23 GMT
Viewed: 
759 times
  
Todd Lehman wrote:
Certain bottlenecks are indeed standing out like sore thumbs.  One big one is
the dynamic generation of the /shop/ pages on guide.lugnet.com, and another
is the dynamic generation of member-specific set lists.  An even bigger one
is (was) the dynamic generation of member listings (people) such as:

   http://www.lugnet.com/people/members/byfirstname.cgi

That was taking several seconds of CPU and in some cases 200+ seconds of
transmission time to people on slow modems.  I can't do anything about the
transmission time without reducing the page size, but for now I changed it
to cache the content for faster page generation.  Holding a large chunk of
data in memory for 200 elapsed-time seconds while generating a page is bad,
even if only a fraction of a CPU second is used.

How about breaking down the member pages into pages split by first
letter of first or last name, and by country/state? I'm not sure if your
"caching" means keeping the indexes by each sort but it would seem this
need not be true dynamic content. You don't get that many updates of
that content.

Another bottleneck is that there seems to be a strong correlation between the
CPU time to generate a page for a LEGO set and the number of people who've
given comments or supplied personal data about the set.  For example, this
page:

   http://guide.lugnet.com/set/7130

consumes 3 times as many CPU cycles as this page:

   http://guide.lugnet.com/set/1462

or about 50 times as many CPU cycles above and beyond the regular overhead
for a set page.

A quick gnuplot graph with number of people on the x-axis and CPU time on
the y-axis showed a nearly straight line correlation.  That's good info for
me to know.  It ought to be is a gradual logarithmic-looking curve rather
than a straight line.

This isn't anyone's fault[1] and it doesn't mean that people need to stop
or entering data -- it just means there's an identifiable inefficiency which
I need to find a way to eliminate.  A bottleneck like this won't get better
with more RAM or a faster CPU -- it's a coding issue in need of algorithmic
adjustments rather than quantitative system tuning.

How about making it optional if the personal data is included? Most
times I don't need it if my purpose for looking up the set is just to
see what the heck is that set. You can have a button on the bottom of
the page which says "show comments and personal data".

The biggies for now are the set guide pages and, of course, bringing back
the news search functionality.  I think I can live with the inefficiencies
of the set guide pages for a few more weeks, because their worst-case
performance is still deterministic.  The worst-case performance of the
news-search function, however, was non-deterministic.

I think I know of a way to make the search much more memory-efficient and
later more time-efficient without radically ripping things apart.  If a
miracle occurs, the news search function might be back sometime this week.

Can you at all conveniently reduce the size of the news search if you
gave an option to "limit search to previous X posts" (where you pick one
or perhaps a handful of values for X, which would preferably be by time
span but could be by number [i.e. "limit search to the past week" would
be preferable to "limit search to the past 2000 posts"). Such search
constraints could certainly be useful and might constrain the search
enough to provide a serious performance improvement.

Frank


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 00:37:38 GMT
Viewed: 
856 times
  
On Sun, Dec 17, 2000 at 11:21:23PM +0000, Frank Filz wrote:
Todd Lehman wrote:
Certain bottlenecks are indeed standing out like sore thumbs.  One big one is
the dynamic generation of the /shop/ pages on guide.lugnet.com, and another
is the dynamic generation of member-specific set lists.  An even bigger one
is (was) the dynamic generation of member listings (people) such as:

   http://www.lugnet.com/people/members/byfirstname.cgi

That was taking several seconds of CPU and in some cases 200+ seconds of
transmission time to people on slow modems.  I can't do anything about the
transmission time without reducing the page size, but for now I changed it
to cache the content for faster page generation.  Holding a large chunk of
data in memory for 200 elapsed-time seconds while generating a page is bad,
even if only a fraction of a CPU second is used.

How about breaking down the member pages into pages split by first
letter of first or last name, and by country/state? I'm not sure if your
"caching" means keeping the indexes by each sort but it would seem this
need not be true dynamic content. You don't get that many updates of
that content.

I think it's more of a problem of having to have the page in memory (all
300k of it) while a slow client downloads it.  Nothing much to do about
that except make the page smaller - perhaps move all the FONT FACE entries
to an all encompassing FONT tag - that should reduce the page size by about
10% or so...

Another bottleneck is that there seems to be a strong correlation between the
CPU time to generate a page for a LEGO set and the number of people who've
given comments or supplied personal data about the set.  For example, this
page:

   http://guide.lugnet.com/set/7130

consumes 3 times as many CPU cycles as this page:

   http://guide.lugnet.com/set/1462

or about 50 times as many CPU cycles above and beyond the regular overhead
for a set page.

A quick gnuplot graph with number of people on the x-axis and CPU time on
the y-axis showed a nearly straight line correlation.  That's good info for
me to know.  It ought to be is a gradual logarithmic-looking curve rather
than a straight line.

This isn't anyone's fault[1] and it doesn't mean that people need to stop
or entering data -- it just means there's an identifiable inefficiency which
I need to find a way to eliminate.  A bottleneck like this won't get better
with more RAM or a faster CPU -- it's a coding issue in need of algorithmic
adjustments rather than quantitative system tuning.

How about making it optional if the personal data is included? Most
times I don't need it if my purpose for looking up the set is just to
see what the heck is that set. You can have a button on the bottom of
the page which says "show comments and personal data".

also, this data isn't that dynamic anyway - couldn't it a static page be
autogenerated every time a member updates their entry?

The biggies for now are the set guide pages and, of course, bringing back
the news search functionality.  I think I can live with the inefficiencies
of the set guide pages for a few more weeks, because their worst-case
performance is still deterministic.  The worst-case performance of the
news-search function, however, was non-deterministic.

I think I know of a way to make the search much more memory-efficient and
later more time-efficient without radically ripping things apart.  If a
miracle occurs, the news search function might be back sometime this week.

Can you at all conveniently reduce the size of the news search if you
gave an option to "limit search to previous X posts" (where you pick one
or perhaps a handful of values for X, which would preferably be by time
span but could be by number [i.e. "limit search to the past week" would
be preferable to "limit search to the past 2000 posts"). Such search
constraints could certainly be useful and might constrain the search
enough to provide a serious performance improvement.

I agree that have a "search X day back" is a good function...  Also, I
don't know how the search is run anyway - if it's indexed beforehand, or
run on the full data every time?

--
Dan Boger / dan@peeron.com / www.peeron.com / ICQ: 1130750
<set:9265_1>:  LEGO DACTA Roof Tiles (DACTA/SYSTEM/Supplemental), '98, 250 pcs


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 00:43:46 GMT
Viewed: 
803 times
  
In lugnet.admin.general, Dan Boger writes:
I agree that have a "search X day back" is a good function...  Also, I
don't know how the search is run anyway - if it's indexed beforehand, or
run on the full data every time?

It's (currently) a half-gigabyte index that gets indexed in realtime (once per
minute).

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 01:08:18 GMT
Viewed: 
779 times
  
Dan Boger wrote:

On Sun, Dec 17, 2000 at 11:21:23PM +0000, Frank Filz wrote:
Todd Lehman wrote:
Certain bottlenecks are indeed standing out like sore thumbs.  One big one is
the dynamic generation of the /shop/ pages on guide.lugnet.com, and another
is the dynamic generation of member-specific set lists.  An even bigger one
is (was) the dynamic generation of member listings (people) such as:

   http://www.lugnet.com/people/members/byfirstname.cgi

That was taking several seconds of CPU and in some cases 200+ seconds of
transmission time to people on slow modems.  I can't do anything about the
transmission time without reducing the page size, but for now I changed it
to cache the content for faster page generation.  Holding a large chunk of
data in memory for 200 elapsed-time seconds while generating a page is bad,
even if only a fraction of a CPU second is used.

How about breaking down the member pages into pages split by first
letter of first or last name, and by country/state? I'm not sure if your
"caching" means keeping the indexes by each sort but it would seem this
need not be true dynamic content. You don't get that many updates of
that content.

I think it's more of a problem of having to have the page in memory (all
300k of it) while a slow client downloads it.  Nothing much to do about
that except make the page smaller - perhaps move all the FONT FACE entries
to an all encompassing FONT tag - that should reduce the page size by about
10% or so...

Splitting the list into a page for each letter (or perhaps groups of 2-4
letters) or country/state would cut the page size down dramatically was
my idea. Almost every time I've gone to look at the list, I'm either
looking for a specific person, or looking for who are members from a
specific state.

Frank


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 01:11:23 GMT
Viewed: 
860 times
  
Todd Lehman wrote:

In lugnet.admin.general, Dan Boger writes:
I agree that have a "search X day back" is a good function...  Also, I
don't know how the search is run anyway - if it's indexed beforehand, or
run on the full data every time?

It's (currently) a half-gigabyte index that gets indexed in realtime (once per
minute).

How long does it take to generate the index? Can you subdivide the index
at all (and do something like backup where you have incremental indices
plus once a day or once an hour regenerate the complete index)? If some
sort of incremental index was used, could you gain anything by not
running it until a search is done? Just some thoughts but probably not
all applicable depending on how it's done.

Frank


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 01:19:42 GMT
Viewed: 
841 times
  
In lugnet.admin.general, Dan Boger writes:

I think it's more of a problem of having to have the page in memory (all
300k of it) while a slow client downloads it.  Nothing much to do about
that except make the page smaller - perhaps move all the FONT FACE entries
to an all encompassing FONT tag - that should reduce the page size by about
10% or so...

What about using a style sheet? Would that help? I admit that I am not sure
exactly what level of browser for the big two correctly supports style
sheets (I ought to know this since I have been using them a lot in recent
web development projects)...

++Lar


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 01:57:04 GMT
Viewed: 
1073 times
  
In lugnet.admin.general, Frank Filz writes:
How long does it take to generate the index?

Zero time.  It's done continuously as a background process.  Once per minute,
any new article is added into the mix.


Can you subdivide the index
at all (and do something like backup where you have incremental indices
plus once a day or once an hour regenerate the complete index)? If some
sort of incremental index was used, could you gain anything by not
running it until a search is done? Just some thoughts but probably not
all applicable depending on how it's done.

This kind of index is more efficient to do as soon as something new appears,
as opposed to a kind of index where content changes periodically and
reindexing is a painful hit.  It's not the indexing that's currently the
bottleneck -- it's the way the query interpreter uses the data from the index
that's the bottleneck.

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 02:44:26 GMT
Viewed: 
896 times
  
In lugnet.admin.general, Larry Pieniazek writes:
I think it's more of a problem of having to have the page in memory (all
300k of it) while a slow client downloads it.  Nothing much to do about
that except make the page smaller - perhaps move all the FONT FACE entries
to an all encompassing FONT tag - that should reduce the page size by about
10% or so...

What about using a style sheet? Would that help? I admit that I am not sure
exactly what level of browser for the big two correctly supports style
sheets (I ought to know this since I have been using them a lot in recent
web development projects)...

Yikes!

Holy speculation, Batman!

Thanks, but no thanks!  Guys, this is nice and all, but you're guessing way
off in left field.  On the pages I was talking about, the actual webpage is
not held in memory -- ever -- it's written directly to a buffered, blocking
I/O socket as soon as it's generated.  It's not stored anywhere on the server
-- not on disk, not in memory, even temporarily.  What I was referring to that
was being held in memory for 200 seconds was some non-HTML data structures.
Their size isn't an insane size, but an uncomfortable and somewhat wasteful
size.  Breaking that bottleneck has about as much to do with style sheets as
a floormat has to do with salad dressing.

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 03:00:42 GMT
Viewed: 
917 times
  
In lugnet.admin.general, Todd Lehman writes:
Breaking that bottleneck has about as much to do with style sheets as
a floormat has to do with salad dressing.

No need to get snippy. We're just trying to help.

We don't have access to the source or even the design documentation so all
we can do is speculate based on what you say, and if our speculation is wide
of the mark, so be it. It's not our fault. (whether it's *appropriate* for
us to have access to the source is an entirely different question, I
continue to lean in the "no it isn't" direction...)

Bet style sheets would cut down the volume of data transmitted (a bit) and
the computational complexity of page generation (a fair bit), nonetheless,
though.

++Lar


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 03:44:47 GMT
Viewed: 
963 times
  
In lugnet.admin.general, Larry Pieniazek writes:
In lugnet.admin.general, Todd Lehman writes:
Breaking that bottleneck has about as much to do with style sheets as
a floormat has to do with salad dressing.

No need to get snippy.

I'm sorry.  I know.  You're right.

It just always happens.  Someone says something vague, someone else guesses
something about it, someone else adds another guess, someone else adds another
guess, and before you know it, everything's way off track.  I hate it when
that happens and I don't have time to jump and respond in until it's too late.


We're just trying to help.

Welp...don't take this the wrong way, but my post earlier was simply to give
an update.  It lacked detail because I wasn't asking for help.  If someone
was here FTF and we had a whiteboard and lots of time to talk and stuff, that
would totally rock, but a news discussion filled with speculations is just a
distraction for me.  (Please don't take that the wrong way.  I appreciate the
kind thoughts and attempts at helping, but they don't always help.)

I feel like I've got a solid handle on most of the bottleneck issues.  They
mostly are SMOPs at this point.  I may need to begin using mod_perl or
FastCGI or a restructure portions using set of custom back-end special-
purpose HTTP servers proxied on the localhost to really eke out the last
drops, but big gains can be had with other more immediate simplifications.


We don't have access to the source or even the design documentation so all
we can do is speculate based on what you say, and if our speculation is wide
of the mark, so be it.  It's not our fault.

No, it isn't your fault.  I just got disgruntled because the discussion
was getting into issues that are great for the .publish group but no longer
have anything to do with actual CPU efficiency (which was the issue).

--Todd


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 05:10:25 GMT
Viewed: 
958 times
  
Todd Lehman wrote:
It just always happens.  Someone says something vague, someone else guesses
something about it, someone else adds another guess, someone else adds another
guess, and before you know it, everything's way off track.  I hate it when
that happens and I don't have time to jump and respond in until it's too late.

Sorry for having contributed to wasted bandwidth, though I do hope you
will give some thought to the ideas of splitting up the members list as
I had suggested, and some kind of "recentness" limit to the search,
though I realize that the underlying data structure may not result in
those operations saving CPU time (and could possibly be more time
consuming depending on implementation), I just feel they would be useful
features for us the users.

Frank


Subject: 
Re: News search function temporarily disabled
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 18 Dec 2000 06:08:07 GMT
Viewed: 
986 times
  
In lugnet.admin.general, Frank Filz writes:
Sorry for having contributed to wasted bandwidth, though I do hope you
will give some thought to the ideas of splitting up the members list as
I had suggested,

Yes, definitely!  And the flags can link to country pages that list just the
people in the country.

and some kind of "recentness" limit to the search,
though I realize that the underlying data structure may not result in
those operations saving CPU time (and could possibly be more time
consuming depending on implementation), I just feel they would be useful
features for us the users.

It actually won't incur extra overhead (amazingly, it will actually reduce
overhead) to restrict things to hard or soft time ranges, so I'm going ahead
with this:

http://news.lugnet.com/admin/general/?n=2701

because it's an excellent opportunity to take that step.  The guts have been
there since August, 1999 -- it just needs to be passed a pair of time range
parameters, which means making an advanced search page with some widgets for
selecting times.

There may also be an implicit improvement in word proximity matching (it will
automatically favor close-proximity multiword matches over disparate-proximity
multiword matches).  Although I don't forsee adding anything anytime soon as
rigorous as a "this exact string" sort of thing, proximity matching comes
pretty darn close to that in pratice.  (LUGNET's news search engine had this
up until August of 1999.  Unfortunately, it used a lot of memory, and I
removed it while applying other improvements at that time.  I believe I'll be
able to add it back in without incurring a large memory or CPU time penalty if
I interleave the multiword results carefully.)

--Todd


Subject: 
News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek, lugnet.announce
Followup-To: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Tue, 2 Jan 2001 07:46:46 GMT
Highlighted: 
!! (details)
Viewed: 
5916 times
  
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.

Everyone's patience during the outage is much appreciated!


Functional improvements:

* Word proximity sensitivity -- two or more words closer together match better
  than the same words far apart.

* Word-order sensitivity -- words in a specific order match better than the
  same words out of order.  For example, "new lego" returns different matches
  than "lego new" (try it!).

* And you can still prefix words with + or - to require inclusion or
  exclusion, respectively.


Cosmetic improvements:

* Graphical horizontal bar showing match rankings.  (Close matches appear
  with wider bars than lesser matches.)

* More streamlined and easy-to-read results header.


Internal improvements:

* Approximately 100 times faster (once the query engine receives the
  request).  Typical CPU utilization is less than 0.1 seconds even for
  queries that generate tens of thousands of word hits.  (Actual times
  may vary depending on disk activity as word-hit lists are accessed.)

* The query engine can take any arbitrary list of news articles as a search
  filter (include or exclude).  This is how subgroup-searches are handled
  now and will pave the way for cooler things later.


To do soon:

* Implement date range restrictions on searches.  Currently searches entire
  corpus of documents and assigns equal date-weight to all documents
  regardless of age.  This is actually working now at the inner levels and
  at the URL level, but there is not yet a forms-based "advanced search"
  user interface for specifying a target date and proximity.


To do someday:

* Facilitate searching within specific threads.  This is a low-level data
  list issue.

* Facilitate searching within arbitrary collections of groups (as opposed to
  a single group or group hierarchy).  This is mostly a user-interface issue.

* Facilitate searching within search results (i.e., "search only within these
  results below").

* Rework the text indexer so that it doesn't throw out "funny characters" in
  words and then reindex the entire document corpus from scratch.  (Currently,
  words like "won't" are converted to "wont" and words like "S@H" are ignored
  entirely.)

* Filter out canceled articles from returned results.

--Todd


p.s.  The article index database currently contains more than 10,000,000
word-hits.


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Tue, 2 Jan 2001 11:04:49 GMT
Viewed: 
882 times
  
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.

Very good!  Although the new search doesn't return most recent articles first
like it used to.  Is that how it should work?  Now I can't see most recent
posts that contain the keyword I want to search for, which makes the search
pretty much useless if it returns thousands of results.

- Dan


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 14:01:42 GMT
Viewed: 
1081 times
  
In lugnet.admin.general, Todd Lehman writes:
To do someday:

* Facilitate searching within specific threads.  This is a low-level data
list issue.

* Facilitate searching within arbitrary collections of groups (as opposed to
a single group or group hierarchy).  This is mostly a user-interface issue.

* Facilitate searching within search results (i.e., "search only within these
results below").

* Rework the text indexer so that it doesn't throw out "funny characters" in
words and then reindex the entire document corpus from scratch.  (Currently,
words like "won't" are converted to "wont" and words like "S@H" are ignored
entirely.)

* Filter out canceled articles from returned results.

Could I suggest some amendments to the "To do someday" list (for an
'advanced search' only)?

- Search by author
- Search by subject line contents
- Search by date range (or open-ended-- i.e. after date X or before date X)
- Search for articles containing a URL
- option to ONLY return the heads of threads
- And of course search by quoted string (i.e. +quack +"foo bar")

Just some wishful thinking...

DaveE


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Tue, 2 Jan 2001 17:31:23 GMT
Viewed: 
1052 times
  
Todd Lehman wrote:
* Implement date range restrictions on searches.  Currently searches entire
  corpus of documents and assigns equal date-weight to all documents
  regardless of age.  This is actually working now at the inner levels and
  at the URL level, but there is not yet a forms-based "advanced search"
  user interface for specifying a target date and proximity.

Could you tell us the URL syntax for those of us willing to modify URLs?

--
Frank Filz

-----------------------------
Work: mailto:ffilz@us.ibm.com (business only please)
Home: mailto:ffilz@mindspring.com


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Tue, 2 Jan 2001 17:38:23 GMT
Viewed: 
4069 times
  
In lugnet.admin.general, Dan Jezek writes:
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.

Very good!  Although the new search doesn't return most recent articles
first like it used to.  Is that how it should work?

Not yet, no.  See "to-do soon" section in previous post.


Now I can't see most recent
posts that contain the keyword I want to search for,

For now, if you don't mind URL mucking, you can manually append

   &qs=<number>

to the URL and it will use that number (in seconds) as a time delta.  For
example, to limit posts to the last 10 days, use

   &qs=864000

Younger matches will tend toward the top and older matches will tend toward
the bottom.  Anything older than 10 days (in the example above) will be
excluded.


which makes the
search pretty much useless if it returns thousands of results.

It's actually in the nature of search engines to generate thousands of
results.  What's more important is the first page returned -- i.e., the
ranking.  Typically one doesn't dig down past the first few, so you rarely
actually go visit all the thousands.  If the results are ranked according
o some time criteria, there'll still be thousands of results, except for
super-restrictive time criteria.

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Tue, 2 Jan 2001 18:27:44 GMT
Viewed: 
1751 times
  
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled...

  <snip>
To do soon:
* Implement date range restrictions on searches.

In lugnet.admin.general, Dan Jezek writes:
...which makes the search pretty much useless...

I would have to concur.

For now, if you don't mind URL mucking, you can manually append
  &qs=<number> to the URL and it will use that number (in seconds)
as a time delta.  For example, to limit posts to the last 10 days,
use   &qs=864000

< goes away and tries it... >

Umm - Would it not make sense to simply include the appropriate
qualifier on the system side?  (I tried it and got two year old
results for "qs", but I'm probably doing something wrong.)

  <snip>
To do someday:
* Facilitate searching within search results

This would be a very useful feature - just like large search "engines"

SRC


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 18:29:14 GMT
Viewed: 
1010 times
  
In lugnet.admin.general, David Eaton writes:
Could I suggest some amendments to the "To do someday" list (for an
'advanced search' only)?

Sure!


- Search by author
- Search by subject line contents

These are mostly covered already due to the way the indexing works -- words
closer to the beginning of a document are given higher weights than words
occurring later in a document.  When the indexer chews on a news article,
it first grabs the 'From:' line and then the 'Subject:' line, and finally
the body.  Somewhere in there, it also grabs the 'Keywords:' and 'Summary:'
headers, if present.

I know what you mean, though, about being able to restrict a search to
_specifically_ some exact subject or author.  I'll think about how I might
be able to handle this in the future -- it would be a separate index database
for each of the two fields.

Maybe you'll later be able to make search queries like this...

   [medieval brikwars] (david eaton) tan baseplates

...which might mean "among articles with subjects matching 'medieval
brickwars', find articles posted by names matching 'david eaton', and
within those, find articles matching 'tan baseplates' in the body.

Is that the sort of functionality you're looking for?


- And of course search by quoted string (i.e. +quack +"foo bar")

This is also largely covered already by word-proximity sensitivity and
word-order sensitivity -- although I definitely agree that it would be
nice for super-refined searches.  I'll think more about it.  I think I
know now of a way to do the "" thing efficiently.  (I didn't months ago.)


- Search by date range (or open-ended-- i.e. after date X or before date X)

OK, I just put this in during lunch...try it out...  It's not a specific
date, but relative dates...

If you search for

   david eaton <10

then it'll show you things matching "david eaton" that were posted "about 10
days ago" (plus or minus 10 days -- with higher matches given to those that
are closer to the 10-days-ago-point).

Alternatively, if you search for

   david eaton <10 >1

then it'll show you things that were posted about 10 days ago, plus or minus
1 day.

If you want to go back farther, you can give days/months/years, for example

   david eaton <15/6/1

would look for (favor) things that were posted 15 days, 6 months, and 1 year
ago.  To limit the range to that time period, plus or minus one week, do

   david eaton <15/6/1 >7

I'm thinkin' this'll be quite useful for digging up stuff that's "about a
week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's
tough to remember an exact date.

   fractals are amazing <7
   building a bicycle wheel <0/3
   sensors and methods for mobile robot positioning <//2


- Search for articles containing a URL

Not sure how to handle this yet.


- option to ONLY return the heads of threads

Ahh yes -- I'll put that on the "must do" list.  That won't be too hard once
the thread lists are generated internally for other purposes.

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 19:00:58 GMT
Viewed: 
1250 times
  
In lugnet.admin.general, Todd Lehman writes:
Is that the sort of functionality you're looking for?

Pretty much... actually, I was thinking more along the lines of an advanced
search form though:

Search for: ______________________ (uses +'s and -'s as is)
Search for text in subject line [] (checkbox)
Posted by: _______________________ (uses +'s and -'s... or no symbols, too)
Search only for heads of a thread [] (checkbox)
Posted before: ___ /____ / ____
Posted after: ___ / ____/ ____

But throwing in symbols/wildcards on the command line instead of in a form
works for me too :)

If you search for

  david eaton <10

then it'll show you things matching "david eaton" that were posted "about 10
days ago" (plus or minus 10 days -- with higher matches given to those that
are closer to the 10-days-ago-point).

So-- looks like if today is day 100, "david eaton <10" would search days
80-100, giving highest precedence to things closest to day 90... cool! I
think that's actually really useful :)

I'm thinkin' this'll be quite useful for digging up stuff that's "about a
week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's
tough to remember an exact date.

Poifect! That's something I had been wanting for a while since I'll know
(for example) that I posted something last spring, but I want to make sure
NOT to search for anything after, say, June, or before March... very cool
indeed :)

- Search for articles containing a URL

Not sure how to handle this yet.

Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm
looking for someone who posted a link to their site... occasionally I've
tried to do this by putting "http" on the query string, although it doesn't
rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html".

- option to ONLY return the heads of threads

Ahh yes -- I'll put that on the "must do" list.  That won't be too hard once
the thread lists are generated internally for other purposes.

Cool!

DaveE


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 19:04:19 GMT
Viewed: 
971 times
  
Todd Lehman wrote:
- Search for articles containing a URL

Not sure how to handle this yet.

One thing which would generally make it pretty easy to find URLs is to
index "http". When dealing with special characters, definitely treat "/"
and "\" as word separators. Probably ":" also.

--
Frank Filz

-----------------------------
Work: mailto:ffilz@us.ibm.com (business only please)
Home: mailto:ffilz@mindspring.com


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 19:07:34 GMT
Viewed: 
1003 times
  
Todd Lehman wrote:
I know what you mean, though, about being able to restrict a search to
_specifically_ some exact subject or author.  I'll think about how I might
be able to handle this in the future -- it would be a separate index database
for each of the two fields.

Do you index the name in "X-real-life-name"?

One thought, index the special strings "from:" and "subject:". The the
search:

   from: ffilz

Should rank my posts highly due to proximity. Of course it would be
better to index the real life name as if it was preceded by from: also
so that you could search:

   from: filz

and find my posts.


--
Frank Filz

-----------------------------
Work: mailto:ffilz@us.ibm.com (business only please)
Home: mailto:ffilz@mindspring.com


Subject: 
RE: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Tue, 2 Jan 2001 23:03:02 GMT
Viewed: 
1005 times
  
David Eaton writes:
In lugnet.admin.general, Todd Lehman writes:
Is that the sort of functionality you're looking for?

Pretty much... actually, I was thinking more along the lines of
an advanced search form though:

Search for: ______________________ (uses +'s and -'s as is)
Search for text in subject line [] (checkbox)
Posted by: _______________________ (uses +'s and -'s... or no
symbols, too)
Search only for heads of a thread [] (checkbox)
Posted before: ___ /____ / ____
Posted after: ___ / ____/ ____

But throwing in symbols/wildcards on the command line instead of in a form
works for me too :)

I think a multi-field advanced search is the way to go...much easier to use,
IMO--something like the DejaNews power search:
http://www.deja.com/home_ps.shtml
The actual search keywords are in one field, and then there are numerouse
ways to limit the search.
--Bram


Bram Lambrecht
bram@cwru.edu
http://home.cwru.edu/~bxl34/


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 01:51:23 GMT
Viewed: 
1040 times
  
In lugnet.admin.general, Frank Filz writes:
Todd Lehman wrote:
I know what you mean, though, about being able to restrict a search to
_specifically_ some exact subject or author.  I'll think about how I might
be able to handle this in the future -- it would be a separate index
database for each of the two fields.

Do you index the name in "X-real-life-name"?

Ya, let's see...as it assembles the text to index, first it grabs
X-Real-Life-Name:, then it grabs either Original-From: or From:, then
Subject:, then Keywords:, then Summary:, and then finally the non-quoted
and non-sig parts of the body.

So, for example, on your post that I'm replying to, it would generate

   frank filz frank filz re news search functoin reactivated was news search
   function temporarily disabled do you index the name in x real life name
   one thought index the special strings from and subject the the serach
   etc., etc.

And then it would remove a few stopwords and then feed that to the actual
indexer.


One thought, index the special strings "from:" and "subject:". The the
search:

   from: ffilz

Should rank my posts highly due to proximity. Of course it would be
better to index the real life name as if it was preceded by from: also
so that you could search:

   from: filz

and find my posts.

Ah.  That's a neat trick!  It's a little English-centric, though, but it's
still a very simple and elegant partial solution...one of those "80% of the
benefits for only 20% of the work" types of things.  Unfortunately, it would
mean reindexing the entire news corpus from scratch, because they'd be
insertions in the numeric word-order lists -- so it's probably something to
do along with other additions the next time the index is rebuilt from scratch.
The last time I rebuilt it, I think it took a whole day, and that was with
about 1/4 as many articles in the system.  (The indexer is optimized for fast
incremental indexing rather than fast one-time building.)

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 02:48:21 GMT
Viewed: 
1306 times
  
In lugnet.admin.general, David Eaton writes:
In lugnet.admin.general, Todd Lehman writes:
Is that the sort of functionality you're looking for?

Pretty much... actually, I was thinking more along the lines of an advanced
search form though:

Search for: ______________________ (uses +'s and -'s as is)
Search for text in subject line [] (checkbox)
Posted by: _______________________ (uses +'s and -'s... or no symbols, too)
Search only for heads of a thread [] (checkbox)
Posted before: ___ /____ / ____
Posted after: ___ / ____/ ____

But throwing in symbols/wildcards on the command line instead of in a form
works for me too :)

Ya, something like that'd be good to slap on top after the base functionality.
:-)  Nobody wants to *have* to remember how all the squiggly and square brace
thingums in a search box work.  :-)


If you search for

  david eaton <10

then it'll show you things matching "david eaton" that were posted "about 10
days ago" (plus or minus 10 days -- with higher matches given to those that
are closer to the 10-days-ago-point).

So-- looks like if today is day 100, "david eaton <10" would search days
80-100, giving highest precedence to things closest to day 90...

Ya, precisely.  It was originally (summer of '99) a smooth bell-shaped curve

   y = exp(-.5 * x^2)

(x being amount of deviation from the target and y being the output function
giving the fitness value) but that was chewing up 10^-6 seconds in one of the
tight inner loops (i.e., wasting 0.07 CPU seconds on a word like 'lego' with
~70000 hits), so I threw that out and changed it to a linear 1-|x| shaped spike
curve

   y = max(0, 1-|x|)

instead.  That doubled the overall throughput and still gave decent results.

The main advantage of the bell curve y=exp(-.5*x^2) is that y>0 for all x,
but that can also be a disadvantage.  The sharp y=max(0,1-|x|) curve has a
nice sudden cutoff at x<1 and x>1.  :-)  In a way I'm kinda glad the bell
curve was so slow to compute


cool! I
think that's actually really useful :)

I'm thinkin' this'll be quite useful for digging up stuff that's "about a
week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's
tough to remember an exact date.

Poifect! That's something I had been wanting for a while since I'll know
(for example) that I posted something last spring, but I want to make sure
NOT to search for anything after, say, June, or before March... very cool
indeed :)

Ya, I can't count the number of times I've wanted to go about "about so many
days or weeks" to look for something.  I'll remember something and not know
exactly what date it was posted, but I'll remember roughly how long ago it
was.  So it's effectively doing a fuzzy date search with variable focus (wide,
narrow, etc.).


- Search for articles containing a URL
Not sure how to handle this yet.
Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm
looking for someone who posted a link to their site... occasionally I've
tried to do this by putting "http" on the query string, although it doesn't
rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html".

I'm not sure why I included "http" in the stopword list.  I apologize for
that.  (Probably because it would have generated zillions upon zillions of
word hits.  "com" is the most frequently indexed word here.)

"http" could certainly stand to go back in now that the query engine is so
much faster.  Maybe even other words like "it" and "that" and "the."

Here's a list of stopwords, BTW...are there any you see here that stand out
in your mind as having given you problems in the past?

   a an the it its it's this that what
   i i'm im my we me us you
   do be am is are was can has
   of for from with to in out on off at as if and but or not no have so
   http www

One thing I *really* hate about stopword lists is that they're so language-
centric (i.e., a stopword in one language might be a darn-tootin' regular
good word in another language).

Also, all single-letter words are ignored (i.e., "a" and "i" for English and
"y" for Spanish, etc.).


- option to ONLY return the heads of threads
Ahh yes -- I'll put that on the "must do" list.  That won't be too hard once
the thread lists are generated internally for other purposes.
Cool!

I've planned ahead here.  For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups.  That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter.  In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.

--Todd


Subject: 
Article bit-flags (was: Re: News search function reactivated)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 03:57:06 GMT
Viewed: 
1105 times
  
In lugnet.admin.general, Todd Lehman writes:
[...]
I've planned ahead here.  For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups.  That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter.  In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.

Oh, one other thing...planning ahead:

Another potential application of article bit-flags is read/unread lists on a
person-by-person basis via the web interface.  I know this is something that
people have been asking for for a long time.  When updating a flag, it would
be as simple as a single disk read/write to change a single bit.  And when
mass-article filtering, it would simply load into memory as a memory-mapped
file and would be extremely efficient in terms of both space and speed.  It
would even be easy to make customized message folders and move messages
between them, delete messages, mark unread, etc.

Would you like to be able to say, "search for 'blah blah blah' within my
'cool models' folder"?

--Todd


Subject: 
Re: Article bit-flags (was: Re: News search function reactivated)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 04:03:46 GMT
Viewed: 
1086 times
  
Todd Lehman wrote:
Oh, one other thing...planning ahead:

Another potential application of article bit-flags is read/unread lists on a
person-by-person basis via the web interface.  I know this is something that
people have been asking for for a long time.  When updating a flag, it would
be as simple as a single disk read/write to change a single bit.  And when
mass-article filtering, it would simply load into memory as a memory-mapped
file and would be extremely efficient in terms of both space and speed.  It
would even be easy to make customized message folders and move messages
between them, delete messages, mark unread, etc.

Would you like to be able to say, "search for 'blah blah blah' within my
'cool models' folder"?

Ooh ooh ooh... One thing I really really want is to be able to put
messages into folders (if anyone knows of a decent newsreader which
allows such - please let me know - it would be preferable that it do so
without requiring me to store the message).

Frank


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 04:17:03 GMT
Viewed: 
1720 times
  
In lugnet.admin.general, Todd Lehman writes:
In lugnet.admin.general, Dan Jezek writes:
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.

Very good!  Although the new search doesn't return most recent articles
first like it used to.  Is that how it should work?
For now, if you don't mind URL mucking, you can manually append

  &qs=<number>

to the URL and it will use that number (in seconds) as a time delta.  For
example, to limit posts to the last 10 days, use

  &qs=864000

It works great! ... but the &qs doesn't carry over to the next page of
results.  So if I want to see more pages, I have to edit the querystring on
each page.  Since you already have the inner workings of this in place, it
would be really easy to just add a textbox named "qs" and add the &qs= to
the bottom "5 more, 10 more"... links.  With a little more effort, you could
include radio buttons to have the user select how many days, months or years
they want to go back and have your search engine convert it to milliseconds
depending on what the user selects.

It's actually in the nature of search engines to generate thousands of
results.

If given thousands of results, most search engines have some advanced
options like sorting.

What's more important is the first page returned -- i.e., the ranking.
Typically one doesn't dig down past the first few, so you rarely
actually go visit all the thousands.

I'd be interested in seeing some statistics on how far the average user goes
when given back let's say 10, 100 and 1,000 pages of results.  It would help
in the design of an effective search engine.


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 05:16:07 GMT
Viewed: 
9466 times
  
In lugnet.admin.general, Dan Jezek writes:
[...]
example, to limit posts to the last 10 days, use

   &qs=864000

It works great! ... but the &qs doesn't carry over to the next page of
results.  So if I want to see more pages, I have to edit the querystring on
each page.

oops, doy!  I didn't put in the propagation of that URL term.  I don't consider
it 100% "documented" yet (it's still subject to change without notice), but I
still shouldn't have missed that.  Thanks.  I'll fix that.

The reason it's subject to change is partially because the letter 's' in 'qs'
is named after the word (or greek letter, rather) 'sigma' -- sigma being 1
standard deviation in the bell curve function f(x) = exp(-x^2/2) -- and that
formula isn't being used anymore in the searches, and partially because 'qs'
might better someday be used for "query subject."  Anyway, it's still not 100%
in stone.  But it'll work until it breaks.


Since you already have the inner workings of this in place, it
would be really easy to just add a textbox named "qs" and add the &qs= to
the bottom "5 more, 10 more"... links.  With a little more effort, you could
include radio buttons to have the user select how many days, months or years
they want to go back and have your search engine convert it to milliseconds
depending on what the user selects.

Yup, that's the idea!!!  Say, where is that old article about sigma and
advanced options...ah! so easy to find now!  :-)

   http://news.lugnet.com/?q=url+query+qs+qt+sigma+%3C//1.5

(See topmost result and related thread for more background.)


It's actually in the nature of search engines to generate thousands of
results.

If given thousands of results, most search engines have some advanced
options like sorting.

Well, they -are- sorted.  They're always sorted -- always highest probability
of relevance first, lowest last.  Usually, the metric for relevance is a
combination of non-temporal factors such as word frequencies, word proximities,
and word orderings.  I don't know of any search engine that doesn't sort (on
some criteria) the matches it finds.  But anyway, I think you meant sorting
by time?

I wonder if a little link at the top to re-deploy the search taking recentness
into account (or conversely, turning it off if it's on) would be useful?


What's more important is the first page returned -- i.e., the ranking.
Typically one doesn't dig down past the first few, so you rarely
actually go visit all the thousands.

I'd be interested in seeing some statistics on how far the average user goes
when given back let's say 10, 100 and 1,000 pages of results.  It would help
in the design of an effective search engine.

Me too.  I'd expect a f(x)=1/x type of curve, but it would be fun to see actual
numbers.  :-)

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 06:16:20 GMT
Viewed: 
4369 times
  
In lugnet.admin.general, Todd Lehman writes:
oops, doy!  I didn't put in the propagation of that URL term.  I don't >consider
it 100% "documented" yet (it's still subject to change without notice), but I
still shouldn't have missed that.  Thanks.  I'll fix that.
The reason it's subject to change is partially because the letter 's' in 'qs'
is named after the word (or greek letter, rather) 'sigma' -- sigma being 1
standard deviation in the bell curve function f(x) = exp(-x^2/2) -- and that
formula isn't being used anymore in the searches, and partially because 'qs'
might better someday be used for "query subject."  Anyway, it's still not 100%
in stone.  But it'll work until it breaks.

Wow!  So you have terms for the ampersand options in a URL?  My standpoint
on this would be to put everything in a form and kill 2 birds with 1 stone -
not having to think of how to name URL terms (unless you enjoy doing that)
and having the search more user-friendly (not everyone will remember the
options or find it easy to edit the URL).

If given thousands of results, most search engines have some advanced
options like sorting.

Well, they -are- sorted.  They're always sorted -- always highest probability
of relevance first, lowest last.  Usually, the metric for relevance is a
combination of non-temporal factors such as word frequencies, word >proximities,
and word orderings.  I don't know of any search engine that doesn't sort (on
some criteria) the matches it finds.  But anyway, I think you meant sorting
by time?

No, I meant having the option to pick between what I want the results to be
sorting on.  Dejanews has a great power search:

http://www.deja.com/home_ps.shtml

which includes the option to sort by relevance, subject, forum, author and
date.  That's how I would like to see the sort options here.  But knowing
that you most likely don't have the resources that dejanews has and how
flawlessly Lugnet runs on the current setup, I'm satisfied with editing the
URL for now :-)

I'd be interested in seeing some statistics on how far the average user goes
when given back let's say 10, 100 and 1,000 pages of results.  It would help
in the design of an effective search engine.

Me too.  I'd expect a f(x)=1/x type of curve, but it would be fun to see >actual numbers.  :-)

It could be done.  Include another version of jump.cgi into the 5 more, 10
more... on the search results page and log the number of results returned,
the IP address and the query subject.  Then run an average, min, max query
grouped by all 3 fields.  Sounds complicated, depends on how badly you want
to see the results.  I wouldn't want to go through the process of
implementing that but would really like to see the results :-)


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general, lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 14:11:53 GMT
Viewed: 
7075 times
  
In lugnet.admin.general, Dan Jezek writes:
Wow!  So you have terms for the ampersand options in a URL?  My standpoint
on this would be to put everything in a form and kill 2 birds with 1 stone -
not having to think of how to name URL terms (unless you enjoy doing that)
and having the search more user-friendly (not everyone will remember the
options or find it easy to edit the URL).

Ya, exactly -- first name the URL components carefully and then put a user-
friendly level on top of it.  Best of both worlds.


No, I meant having the option to pick between what I want the results to be
sorting on.  Dejanews has a great power search:

http://www.deja.com/home_ps.shtml

which includes the option to sort by relevance, subject, forum, author and
date.  That's how I would like to see the sort options here.

Ah, I see.  Yeah, that could be helpful in certain cases, if you're scouring
tons of results!  I've needed to look things up on Deja.com, so I know what
you mean.


But knowing
that you most likely don't have the resources that dejanews has and how
flawlessly Lugnet runs on the current setup, I'm satisfied with editing the
URL for now :-)

There's an alternate form that avois the &qs= thingie, so you don't have to
edit the URLs:

http://news.lugnet.com/admin/general/?n=8613


It could be done.  Include another version of jump.cgi into the 5 more, 10
more... on the search results page and log the number of results returned,
the IP address and the query subject.

These don't actually run through jump.cgi.  But they're already logged by
httpd anyway.  (That's how the jump.cgi logging is implemented as well.)


Then run an average, min, max query
grouped by all 3 fields.  Sounds complicated, depends on how badly you want
to see the results.  I wouldn't want to go through the process of
implementing that but would really like to see the results :-)

Hmm, it's all there now, except for logging the number of results produced.
I guess it could be as simple as open for append, flock, print, and close on
a filehandle inside of the search page...lemme think about it.  Analyzing the
results and making a graph would be a snap with gnuplot.

I think it would be especially fun to compare the graph now to the way it was
(would have been) before the change...but alas, that data was never captured
for the old query engine and it's too late now.

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 3 Jan 2001 14:44:55 GMT
Viewed: 
1115 times
  
In lugnet.admin.general, Todd Lehman writes:
I'm not sure why I included "http" in the stopword list.  I apologize for
that.  (Probably because it would have generated zillions upon zillions of
word hits.  "com" is the most frequently indexed word here.)

"http" could certainly stand to go back in now that the query engine is so
much faster.  Maybe even other words like "it" and "that" and "the."

As an algorithmical guess, I think I'd probably attempt something a bit
different... If someone enters:

the

I'd probably want to ignore it. But if they entered:

the best design

I might want to consider the 'the'. Dunno. I'd probably test an algorithm
that ignored all stopwords in the initial search, but then for each result
of the initial search, score it with respect to any stopwords found (&
proximity, etc).

Here's a list of stopwords, BTW...are there any you see here that stand out
in your mind as having given you problems in the past?

  a an the it its it's this that what
  i i'm im my we me us you
  do be am is are was can has
  of for from with to in out on off at as if and but or not no have so
  http www

The only things here that've given me problems (or that I'd expect might
give others problems [apart from language differences]) are http and www.
And now that I look at it, I've got to ask, if I specified "http" as a
search parameter, would it in fact score every post made with the webserver
as having an instance thanks to the X-Nntp-Gateway? Same with www, I
suppose, except that all new webserver posts are from 'news.lugnet.com'
instead-- but all older posts would show up? Or are various header info ignored?

One thing I *really* hate about stopword lists is that they're so language-
centric (i.e., a stopword in one language might be a darn-tootin' regular
good word in another language).

It does present a rather unique problem (unless the search proves capable of
handling stopwords)... Fortunately the vast majority of posts seem to be in
one language for now, even though there has been a distinct increase in some
other languages in the last year and a half or so :)

I've planned ahead here.  For each group, there'll be a list of articles that
comprise the heads-of-threads for those groups.  That list can the be used to
generate more compact views into the group, or it can also be fed into the
query engine as an "include only these" filter.  In memory, once loaded, the
article filter lists are 1-bit flags -- 1 bit per article position -- so even
a list of a quarter million articles consumes only 30 KB of memory for the
fraction of a second that it's needed.

Very cool! You mention later the prospect of having 'folders' for things
like 'cool sites' or 'space models' or 'great building ideas' for members
using the web-interface... Would this be something user-configurable? Could
I create as many folders as I would want, or would there be a set number of
X folders that would be pre-determined?

Either way (assuming people wouldn't make hundreds of folders for
themselves) it sounds like there's not a space concern (my worst estimates
of millions of posts and thousands of users weren't all that bad
considering). And as long as the interface isn't doing something dynamic
with the folders (making little action icons for each folder on article
listing pages, etc), sounds like there wouldn't be a real time strain
either... A very cool idea!

DaveE


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 16:34:22 GMT
Viewed: 
776 times
  
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.

All geeks capitulate sooner or later on perl vs C.  Of course
Larry (and many others I work with) would tell you to write that
stuff in Java but that would be a step backwards.

Congratulations!  Of course it's stability depends greatly on your
dilligence in policing your pointers.

KL


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 17:21:47 GMT
Viewed: 
815 times
  
In lugnet.off-topic.geek, Kevin Loch writes:
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.
All geeks capitulate sooner or later on perl vs C.  Of course
Larry (and many others I work with) would tell you to write that
stuff in Java but that would be a step backwards.

I like Java.  But this really needed to be close to the metal and generate
code that would fit in the L1 cache for the non-memory-bus-bound portions
of the loops.  The GNU C compiler is incredible.


Congratulations!  Of course it's stability depends greatly on your
dilligence in policing your pointers.

It's the best C code I've written in 12 years.  I think being away from C for
3 1/2 years has helped.

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 20:41:42 GMT
Viewed: 
822 times
  
In lugnet.off-topic.geek, Todd Lehman writes:
In lugnet.off-topic.geek, Kevin Loch writes:
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.
All geeks capitulate sooner or later on perl vs C.  Of course
Larry (and many others I work with) would tell you to write that
stuff in Java but that would be a step backwards.

I like Java.  But this really needed to be close to the metal and generate
code that would fit in the L1 cache for the non-memory-bus-bound portions
of the loops.  The GNU C compiler is incredible.


Aah, you definately couldn't have done that in Java.  Of course the ability
to declare a couple register variables helps too.

KL


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 22:39:52 GMT
Viewed: 
835 times
  
Kevin Loch wrote:

In lugnet.off-topic.geek, Todd Lehman writes:
In lugnet.off-topic.geek, Kevin Loch writes:
In lugnet.admin.general, Todd Lehman writes:
The LUGNET News search function is now re-enabled.  I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C.  It's a much more solid implementation.
All geeks capitulate sooner or later on perl vs C.  Of course
Larry (and many others I work with) would tell you to write that
stuff in Java but that would be a step backwards.

I like Java.  But this really needed to be close to the metal and generate
code that would fit in the L1 cache for the non-memory-bus-bound portions
of the loops.  The GNU C compiler is incredible.


Aah, you definately couldn't have done that in Java.  Of course the ability
to declare a couple register variables helps too.

Of course I've heard of situations where an interpreter outdid hand
crafted assembler. This can occur if the portion of the interpreter
necessary to run your code fits in the code cache and the byte codes fit
in the data cache when the hand crafted assembler code wouldn't fit in
the code cache. If the code is processing a large quantity of data, but
not revisiting the data, the data cache may be irrelevant for the
processed data (there can even be a little bit of references if they are
still relatively local - for example if you were doing some kind of
vector computation where res[i] = f(data[i], data[i+1])).

--
Frank Filz

-----------------------------
Work: mailto:ffilz@us.ibm.com (business only please)
Home: mailto:ffilz@mindspring.com


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 3 Jan 2001 23:18:06 GMT
Viewed: 
832 times
  
In lugnet.off-topic.geek, Kevin Loch writes:
I like Java.  But this really needed to be close to the metal and generate
code that would fit in the L1 cache for the non-memory-bus-bound portions
of the loops.  The GNU C compiler is incredible.
Aah, you definately couldn't have done that in Java.

I probably couldn't, no, but a very experienced Java programmer and a good
JVM machine could conceivably do better than C.  (It's not unheard of for
Java to be faster than C for certain types of things.)  The big hits would
probably be the JVM environment startup and the array boundary checking.
Anyway, ya, I wouldn't wanna try this sort of thing in Java...I'm much more
comfortable with C when it comes to this sort of metal grinding.  :-)


Of course the ability to declare a couple register variables helps too.

Well, any good C compiler these days actually ignores the 'register' keyword.
(Right?)  The compiler does a better job of register allocation than a human
does (when it comes to assigning C variables to registers), especially on
modern superscalar pipelined CPU architectures.

--Todd


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Thu, 4 Jan 2001 05:07:43 GMT
Viewed: 
1215 times
  
In lugnet.admin.general, David Eaton writes:
As an algorithmical guess, I think I'd probably attempt something a bit
different... If someone enters:
   the
I'd probably want to ignore it. But if they entered:
   the best design
I might want to consider the 'the'. Dunno. I'd probably test an algorithm
that ignored all stopwords in the initial search, but then for each result
of the initial search, score it with respect to any stopwords found (&
proximity, etc).

Oh -- actually, what search engines typically do on queries (and I just
finally added this last week) is downvalue relatively common words and
upvalue relatively uncommon words -- what's called "term ranking" or "term
weighting."  For example, when you search for "lego duplo", there are
(currently) about 65,000 documents accounting for about 140,000 word-hits
of "lego" and only about 1500 documents accounting for only about 3300
word-hits of "duplo".  So the importance of the word "duplo" is very high
relative to the word "lego".  Similarly, there are tons of articles with
"david" but only about one fifth as many with "eaton" -- so a search for
"david eaton" has an easier time finding David Eaton among all the other
Davids.


The only things here that've given me problems (or that I'd expect might
give others problems [apart from language differences]) are http and www.
And now that I look at it, I've got to ask, if I specified "http" as a
search parameter, would it in fact score every post made with the webserver
as having an instance thanks to the X-Nntp-Gateway?  Same with www, I
suppose, except that all new webserver posts are from 'news.lugnet.com'
instead-- but all older posts would show up? Or are various header info
ignored?

Right.  It only includes these headers in the indexed text:

   X-Real-Life-Name:
   Original-From:
   From:
   Subject:
   Keywords:
   Summary:

Other headers are ignored.  Quoted content is also ignored.  And it also does
its best to ignore lines like "on such and such a date, so and so wrote..."
and sigs.


Very cool! You mention later the prospect of having 'folders' for things
like 'cool sites' or 'space models' or 'great building ideas' for members
using the web-interface... Would this be something user-configurable?

Yes.


Could I create as many folders as I would want, or would there be a set
number of X folders that would be pre-determined?

Unread, Read, Save, and Trash would probably be the only predefined folders.
All the default names would be renamable to whatever name you want.  (I don't
want them to have to be in English.)  You'll prolly be able to set any
bitfields you want on any folder.  Bitfield options would be things like
"this is the trash folder" and "this is an incoming folder."  The Trash folder
would simply be a folder named "Trash" with the "this is a trash folder"
property set.  The Unread folder would simply be a folder named "Unread" that
was configured to receive new articles from some mix of groups you defined.
As soon as you read an article, it would hop to some other folder -- e.g.,
the Read folder or wherever.  The Save folder wouldn't be anything special
-- just a regular folder.  If you deleted something, it would go to the Trash
folder.  If you delete something from the Trash or empty the Trash, it goes
away for good (away from your personal lists, that is...it would still be on
the newsserver, naturally).


Either way (assuming people wouldn't make hundreds of folders for
themselves) it sounds like there's not a space concern (my worst estimates
of millions of posts and thousands of users weren't all that bad
considering).

Ya, with the right data structures, space isn't an issue.  Sparse bitfields
and ID lists are the way to go.

--Todd


And as long as the interface isn't doing something dynamic
with the folders (making little action icons for each folder on article
listing pages, etc), sounds like there wouldn't be a real time strain
either... A very cool idea!

DaveE


Subject: 
Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups: 
lugnet.admin.general
Date: 
Mon, 5 Feb 2001 01:41:53 GMT
Viewed: 
1303 times
  
In lugnet.admin.general, Todd Lehman writes:
It's actually in the nature of search engines to generate thousands of
results.

If given thousands of results, most search engines have some advanced
options like sorting.

Well, they -are- sorted.  They're always sorted -- always highest probability
of relevance first, lowest last.  Usually, the metric for relevance is a
combination of non-temporal factors such as word frequencies, word
proximities, and word orderings.  I don't know of any search engine that
doesn't sort (on some criteria) the matches it finds.  But anyway, I think
you meant sorting by time?

I wonder if a little link at the top to re-deploy the search taking recentness
into account (or conversely, turning it off if it's on) would be useful?

Todd, this would be really useful. I'll often search for a recent post, only
remembering the poster's name and maybe one or two key-words, and that the
post was in the past few days. I don't need two year old messages nearly as
frequently. Could you change the display so that when results have the same
score more recent posts are displayed first? I think a score weighting for
recentness would be even more useful as part of the default setting --
obviously that would be up to you.

--DaveL


©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR