General Discussions about LUGNET : 8457


Administrative / General / 8457

Subject:	News search function temporarily disabled
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek, lugnet.announce
Followup-To:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Mon, 11 Dec 2000 17:34:24 GMT
Highlighted:	! (details)
Viewed:	4182 times

The system crashed again. Either it's running out of memory and that's causing a downward spiral or it's running out of CPU cycles and starving enough processes to build up and cause a meltdown. Either way some tuning needs to be done. This is going to be my top LUGNET priority until things stabilize -- they've been shaky for the past couple of weeks. I need to find the bottlenecks and eliminate them. As a first step, I've turned off the text-search function for news. I apologize for any inconvenience this may cause, but it will help me know how much of a bottleneck it was causing. Its algorithm for merging results is grossly inadequate for some of the queries it processes. As a second step, I'm going to install a monitoring log that records the elapsed time and CPU time of every dynamically generated webpage. From those results, the bottlenecks will stand out like a fistful of sore thumbs. (Fixing them is a different matter.) I'm taking lunch at work right now; I'll be able to install the monitoring script quickly and then I'll have a look at the stored results later tonight. I probably won't be able to re-activate the text-search function of news until after replacing it with something that has better worst-case performance. There's a chance I may have to disable other things, like the Mosaic Maker (if it's being used and contributing to the bottleneck -- I don't know yet whether it is). Sometime before the end of December, I plan also to upgrade the RAM on the system. In Q1 of 2001, we'll add another physical box as well. --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Wed, 13 Dec 2000 23:34:43 GMT
Viewed:	1227 times

<nod nod> Thanks for letting us know. I'm no geek in any means, but first things first, I would limit the search. Sometimes a single search brings up thousands of posts. I suggest if the search is too broad, say it and make the searcher narrow it down a bit. HTH, and keep up the good work, -Shiri

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Sun, 17 Dec 2000 01:26:09 GMT
Viewed:	1275 times

In lugnet.admin.general, Todd Lehman writes: > > As a first step, I've turned off the text-search function for news. I > apologize for any inconvenience this may cause, but it will help me know > how much of a bottleneck it was causing. Its algorithm for merging results > is grossly inadequate for some of the queries it processes. Hi Todd, Did you ever consider a google search box? Like this one: www.sis.pitt.edu/~dist I don't know for sure, but I think that the search is done on google's server. Toki

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Sun, 17 Dec 2000 01:36:12 GMT
Viewed:	1337 times

In lugnet.admin.general, Toki Barron writes: > Hi Todd, > Did you ever consider a google search box? Like this one: > www.sis.pitt.edu/~dist That's an intriguing thought. It looks like it requires Google to be able to crawl a site completely, though. I wonder if they play well with dynamically generated content... > I don't know for sure, but I think that the search is done on google's > server. Yes, it is. --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Sun, 17 Dec 2000 01:56:53 GMT
Viewed:	1366 times

Todd Lehman wrote: > > In lugnet.admin.general, Toki Barron writes: > > Hi Todd, > > Did you ever consider a google search box? Like this one: > > www.sis.pitt.edu/~dist > > That's an intriguing thought. It looks like it requires Google to be able > to crawl a site completely, though. I wonder if they play well with > dynamically generated content... I've had nothing but problems with www.hort.net/gallery/, which is all dynamically generated. Every month at about the same time we disappear entirely out of Google's search engine, which they claim is due to updates. Although it usually only lasts a couple of days, the last one has lasted about six weeks. They can't seem to explain why it happens, either. Anyhow, I guess my point is that I don't find them to be particularly reliable. Chris

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Sun, 17 Dec 2000 17:34:18 GMT
Viewed:	1211 times

In lugnet.admin.general, Todd Lehman writes: > [...] > As a second step, I'm going to install a monitoring log that records the > elapsed time and CPU time of every dynamically generated webpage. From > those results, the bottlenecks will stand out like a fistful of sore thumbs. > (Fixing them is a different matter.) > [...] Certain bottlenecks are indeed standing out like sore thumbs. One big one is the dynamic generation of the /shop/ pages on guide.lugnet.com, and another is the dynamic generation of member-specific set lists. An even bigger one is (was) the dynamic generation of member listings (people) such as: http://www.lugnet.com/people/members/byfirstname.cgi That was taking several seconds of CPU and in some cases 200+ seconds of transmission time to people on slow modems. I can't do anything about the transmission time without reducing the page size, but for now I changed it to cache the content for faster page generation. Holding a large chunk of data in memory for 200 elapsed-time seconds while generating a page is bad, even if only a fraction of a CPU second is used. Another bottleneck is that there seems to be a strong correlation between the CPU time to generate a page for a LEGO set and the number of people who've given comments or supplied personal data about the set. For example, this page: http://guide.lugnet.com/set/7130 consumes 3 times as many CPU cycles as this page: http://guide.lugnet.com/set/1462 or about 50 times as many CPU cycles above and beyond the regular overhead for a set page. A quick gnuplot graph with number of people on the x-axis and CPU time on the y-axis showed a nearly straight line correlation. That's good info for me to know. It ought to be is a gradual logarithmic-looking curve rather than a straight line. This isn't anyone's fault[1] and it doesn't mean that people need to stop or entering data -- it just means there's an identifiable inefficiency which I need to find a way to eliminate. A bottleneck like this won't get better with more RAM or a faster CPU -- it's a coding issue in need of algorithmic adjustments rather than quantitative system tuning. Some of these bottlenecks amount to a significant portion of the CPU-cycles pie. For example, the member listings consume(d) significant CPU cycles but aren't accessed anywhere near as often as the set guide pages; the bottleneck in the set guide pages is small, but adds up to a lot of wasted CPU cycles overall. Across all dynamically generated pages served, the single-set-view guide pages together account for 23% and multi-set-view guide pages together account for 18%. I estimate that one to two hours per day of CPU cycles are currently being wasted due to this bottleneck alone (based on a desired page-generation time of 0.2 seconds (average) per page). The news homepage <http://news.lugnet.com/> is also a bottleneck, averaging a very embarrassing 0.537 seconds of CPU cycles per invocation, but in the end it actually amounts to only about 1000 seconds of CPU time per day, so it's actually not a true bottleneck at its current rate of request (~2000 page views per day). It *could* turn out to be a bottleneck someday if the usage went up significantly from where it is during peak hours, but for now it's not a real issue. The main LUGNET homepage <http://www.lugnet.com/> sees two to three times as much traffic as the LUGNET News homepage but is also three times more efficient in terms of memory and CPU cycles, so I'm not even remotely worried about it yet. Everything else falls into group of page types. The biggies for now are the set guide pages and, of course, bringing back the news search functionality. I think I can live with the inefficiencies of the set guide pages for a few more weeks, because their worst-case performance is still deterministic. The worst-case performance of the news-search function, however, was non-deterministic. I think I know of a way to make the search much more memory-efficient and later more time-efficient without radically ripping things apart. If a miracle occurs, the news search function might be back sometime this week. --Todd [1] except mine, of course

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Sun, 17 Dec 2000 23:21:23 GMT
Viewed:	1181 times

Todd Lehman wrote: > Certain bottlenecks are indeed standing out like sore thumbs. One big one is > the dynamic generation of the /shop/ pages on guide.lugnet.com, and another > is the dynamic generation of member-specific set lists. An even bigger one > is (was) the dynamic generation of member listings (people) such as: > > http://www.lugnet.com/people/members/byfirstname.cgi > > That was taking several seconds of CPU and in some cases 200+ seconds of > transmission time to people on slow modems. I can't do anything about the > transmission time without reducing the page size, but for now I changed it > to cache the content for faster page generation. Holding a large chunk of > data in memory for 200 elapsed-time seconds while generating a page is bad, > even if only a fraction of a CPU second is used. How about breaking down the member pages into pages split by first letter of first or last name, and by country/state? I'm not sure if your "caching" means keeping the indexes by each sort but it would seem this need not be true dynamic content. You don't get that many updates of that content. > Another bottleneck is that there seems to be a strong correlation between the > CPU time to generate a page for a LEGO set and the number of people who've > given comments or supplied personal data about the set. For example, this > page: > > http://guide.lugnet.com/set/7130 > > consumes 3 times as many CPU cycles as this page: > > http://guide.lugnet.com/set/1462 > > or about 50 times as many CPU cycles above and beyond the regular overhead > for a set page. > > A quick gnuplot graph with number of people on the x-axis and CPU time on > the y-axis showed a nearly straight line correlation. That's good info for > me to know. It ought to be is a gradual logarithmic-looking curve rather > than a straight line. > > This isn't anyone's fault[1] and it doesn't mean that people need to stop > or entering data -- it just means there's an identifiable inefficiency which > I need to find a way to eliminate. A bottleneck like this won't get better > with more RAM or a faster CPU -- it's a coding issue in need of algorithmic > adjustments rather than quantitative system tuning. How about making it optional if the personal data is included? Most times I don't need it if my purpose for looking up the set is just to see what the heck is that set. You can have a button on the bottom of the page which says "show comments and personal data". > The biggies for now are the set guide pages and, of course, bringing back > the news search functionality. I think I can live with the inefficiencies > of the set guide pages for a few more weeks, because their worst-case > performance is still deterministic. The worst-case performance of the > news-search function, however, was non-deterministic. > > I think I know of a way to make the search much more memory-efficient and > later more time-efficient without radically ripping things apart. If a > miracle occurs, the news search function might be back sometime this week. Can you at all conveniently reduce the size of the news search if you gave an option to "limit search to previous X posts" (where you pick one or perhaps a handful of values for X, which would preferably be by time span but could be by number [i.e. "limit search to the past week" would be preferable to "limit search to the past 2000 posts"). Such search constraints could certainly be useful and might constrain the search enough to provide a serious performance improvement. Frank

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 00:37:38 GMT
Viewed:	1292 times

On Sun, Dec 17, 2000 at 11:21:23PM +0000, Frank Filz wrote: > Todd Lehman wrote: > > Certain bottlenecks are indeed standing out like sore thumbs. One big one is > > the dynamic generation of the /shop/ pages on guide.lugnet.com, and another > > is the dynamic generation of member-specific set lists. An even bigger one > > is (was) the dynamic generation of member listings (people) such as: > > > > http://www.lugnet.com/people/members/byfirstname.cgi > > > > That was taking several seconds of CPU and in some cases 200+ seconds of > > transmission time to people on slow modems. I can't do anything about the > > transmission time without reducing the page size, but for now I changed it > > to cache the content for faster page generation. Holding a large chunk of > > data in memory for 200 elapsed-time seconds while generating a page is bad, > > even if only a fraction of a CPU second is used. > > How about breaking down the member pages into pages split by first > letter of first or last name, and by country/state? I'm not sure if your > "caching" means keeping the indexes by each sort but it would seem this > need not be true dynamic content. You don't get that many updates of > that content. I think it's more of a problem of having to have the page in memory (all 300k of it) while a slow client downloads it. Nothing much to do about that except make the page smaller - perhaps move all the FONT FACE entries to an all encompassing FONT tag - that should reduce the page size by about 10% or so... > > Another bottleneck is that there seems to be a strong correlation between the > > CPU time to generate a page for a LEGO set and the number of people who've > > given comments or supplied personal data about the set. For example, this > > page: > > > > http://guide.lugnet.com/set/7130 > > > > consumes 3 times as many CPU cycles as this page: > > > > http://guide.lugnet.com/set/1462 > > > > or about 50 times as many CPU cycles above and beyond the regular overhead > > for a set page. > > > > A quick gnuplot graph with number of people on the x-axis and CPU time on > > the y-axis showed a nearly straight line correlation. That's good info for > > me to know. It ought to be is a gradual logarithmic-looking curve rather > > than a straight line. > > > > This isn't anyone's fault[1] and it doesn't mean that people need to stop > > or entering data -- it just means there's an identifiable inefficiency which > > I need to find a way to eliminate. A bottleneck like this won't get better > > with more RAM or a faster CPU -- it's a coding issue in need of algorithmic > > adjustments rather than quantitative system tuning. > > How about making it optional if the personal data is included? Most > times I don't need it if my purpose for looking up the set is just to > see what the heck is that set. You can have a button on the bottom of > the page which says "show comments and personal data". also, this data isn't that dynamic anyway - couldn't it a static page be autogenerated every time a member updates their entry? > > The biggies for now are the set guide pages and, of course, bringing back > > the news search functionality. I think I can live with the inefficiencies > > of the set guide pages for a few more weeks, because their worst-case > > performance is still deterministic. The worst-case performance of the > > news-search function, however, was non-deterministic. > > > > I think I know of a way to make the search much more memory-efficient and > > later more time-efficient without radically ripping things apart. If a > > miracle occurs, the news search function might be back sometime this week. > > Can you at all conveniently reduce the size of the news search if you > gave an option to "limit search to previous X posts" (where you pick one > or perhaps a handful of values for X, which would preferably be by time > span but could be by number [i.e. "limit search to the past week" would > be preferable to "limit search to the past 2000 posts"). Such search > constraints could certainly be useful and might constrain the search > enough to provide a serious performance improvement. I agree that have a "search X day back" is a good function... Also, I don't know how the search is run anyway - if it's indexed beforehand, or run on the full data every time? -- Dan Boger / dan@peeron.com / www.peeron.com / ICQ: 1130750 <set:9265_1>: LEGO DACTA Roof Tiles (DACTA/SYSTEM/Supplemental), '98, 250 pcs

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 00:43:46 GMT
Viewed:	1251 times

In lugnet.admin.general, Dan Boger writes: > I agree that have a "search X day back" is a good function... Also, I > don't know how the search is run anyway - if it's indexed beforehand, or > run on the full data every time? It's (currently) a half-gigabyte index that gets indexed in realtime (once per minute). --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 01:08:18 GMT
Viewed:	1215 times

Dan Boger wrote: > > On Sun, Dec 17, 2000 at 11:21:23PM +0000, Frank Filz wrote: > > Todd Lehman wrote: > > > Certain bottlenecks are indeed standing out like sore thumbs. One big one is > > > the dynamic generation of the /shop/ pages on guide.lugnet.com, and another > > > is the dynamic generation of member-specific set lists. An even bigger one > > > is (was) the dynamic generation of member listings (people) such as: > > > > > > http://www.lugnet.com/people/members/byfirstname.cgi > > > > > > That was taking several seconds of CPU and in some cases 200+ seconds of > > > transmission time to people on slow modems. I can't do anything about the > > > transmission time without reducing the page size, but for now I changed it > > > to cache the content for faster page generation. Holding a large chunk of > > > data in memory for 200 elapsed-time seconds while generating a page is bad, > > > even if only a fraction of a CPU second is used. > > > > How about breaking down the member pages into pages split by first > > letter of first or last name, and by country/state? I'm not sure if your > > "caching" means keeping the indexes by each sort but it would seem this > > need not be true dynamic content. You don't get that many updates of > > that content. > > I think it's more of a problem of having to have the page in memory (all > 300k of it) while a slow client downloads it. Nothing much to do about > that except make the page smaller - perhaps move all the FONT FACE entries > to an all encompassing FONT tag - that should reduce the page size by about > 10% or so... Splitting the list into a page for each letter (or perhaps groups of 2-4 letters) or country/state would cut the page size down dramatically was my idea. Almost every time I've gone to look at the list, I'm either looking for a specific person, or looking for who are members from a specific state. Frank

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 01:11:23 GMT
Viewed:	1320 times

Todd Lehman wrote: > > In lugnet.admin.general, Dan Boger writes: > > I agree that have a "search X day back" is a good function... Also, I > > don't know how the search is run anyway - if it's indexed beforehand, or > > run on the full data every time? > > It's (currently) a half-gigabyte index that gets indexed in realtime (once per > minute). How long does it take to generate the index? Can you subdivide the index at all (and do something like backup where you have incremental indices plus once a day or once an hour regenerate the complete index)? If some sort of incremental index was used, could you gain anything by not running it until a search is done? Just some thoughts but probably not all applicable depending on how it's done. Frank

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 01:19:42 GMT
Viewed:	1286 times

In lugnet.admin.general, Dan Boger writes: > I think it's more of a problem of having to have the page in memory (all > 300k of it) while a slow client downloads it. Nothing much to do about > that except make the page smaller - perhaps move all the FONT FACE entries > to an all encompassing FONT tag - that should reduce the page size by about > 10% or so... What about using a style sheet? Would that help? I admit that I am not sure exactly what level of browser for the big two correctly supports style sheets (I ought to know this since I have been using them a lot in recent web development projects)... ++Lar

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 01:57:04 GMT
Viewed:	1586 times

In lugnet.admin.general, Frank Filz writes: > How long does it take to generate the index? Zero time. It's done continuously as a background process. Once per minute, any new article is added into the mix. > Can you subdivide the index > at all (and do something like backup where you have incremental indices > plus once a day or once an hour regenerate the complete index)? If some > sort of incremental index was used, could you gain anything by not > running it until a search is done? Just some thoughts but probably not > all applicable depending on how it's done. This kind of index is more efficient to do as soon as something new appears, as opposed to a kind of index where content changes periodically and reindexing is a painful hit. It's not the indexing that's currently the bottleneck -- it's the way the query interpreter uses the data from the index that's the bottleneck. --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 02:44:26 GMT
Viewed:	1361 times

In lugnet.admin.general, Larry Pieniazek writes: > > I think it's more of a problem of having to have the page in memory (all > > 300k of it) while a slow client downloads it. Nothing much to do about > > that except make the page smaller - perhaps move all the FONT FACE entries > > to an all encompassing FONT tag - that should reduce the page size by about > > 10% or so... > > What about using a style sheet? Would that help? I admit that I am not sure > exactly what level of browser for the big two correctly supports style > sheets (I ought to know this since I have been using them a lot in recent > web development projects)... Yikes! Holy speculation, Batman! Thanks, but no thanks! Guys, this is nice and all, but you're guessing way off in left field. On the pages I was talking about, the actual webpage is not held in memory -- ever -- it's written directly to a buffered, blocking I/O socket as soon as it's generated. It's not stored anywhere on the server -- not on disk, not in memory, even temporarily. What I was referring to that was being held in memory for 200 seconds was some non-HTML data structures. Their size isn't an insane size, but an uncomfortable and somewhat wasteful size. Breaking that bottleneck has about as much to do with style sheets as a floormat has to do with salad dressing. --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 03:00:42 GMT
Viewed:	1386 times

In lugnet.admin.general, Todd Lehman writes: > Breaking that bottleneck has about as much to do with style sheets as > a floormat has to do with salad dressing. No need to get snippy. We're just trying to help. We don't have access to the source or even the design documentation so all we can do is speculate based on what you say, and if our speculation is wide of the mark, so be it. It's not our fault. (whether it's *appropriate* for us to have access to the source is an entirely different question, I continue to lean in the "no it isn't" direction...) Bet style sheets would cut down the volume of data transmitted (a bit) and the computational complexity of page generation (a fair bit), nonetheless, though. ++Lar

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 03:44:47 GMT
Viewed:	1443 times

In lugnet.admin.general, Larry Pieniazek writes: > In lugnet.admin.general, Todd Lehman writes: > > Breaking that bottleneck has about as much to do with style sheets as > > a floormat has to do with salad dressing. > > No need to get snippy. I'm sorry. I know. You're right. It just always happens. Someone says something vague, someone else guesses something about it, someone else adds another guess, someone else adds another guess, and before you know it, everything's way off track. I hate it when that happens and I don't have time to jump and respond in until it's too late. > We're just trying to help. Welp...don't take this the wrong way, but my post earlier was simply to give an update. It lacked detail because I wasn't asking for help. If someone was here FTF and we had a whiteboard and lots of time to talk and stuff, that would totally rock, but a news discussion filled with speculations is just a distraction for me. (Please don't take that the wrong way. I appreciate the kind thoughts and attempts at helping, but they don't always help.) I feel like I've got a solid handle on most of the bottleneck issues. They mostly are SMOPs at this point. I may need to begin using mod_perl or FastCGI or a restructure portions using set of custom back-end special- purpose HTTP servers proxied on the localhost to really eke out the last drops, but big gains can be had with other more immediate simplifications. > We don't have access to the source or even the design documentation so all > we can do is speculate based on what you say, and if our speculation is wide > of the mark, so be it. It's not our fault. No, it isn't your fault. I just got disgruntled because the discussion was getting into issues that are great for the .publish group but no longer have anything to do with actual CPU efficiency (which was the issue). --Todd

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 05:10:25 GMT
Viewed:	1446 times

Todd Lehman wrote: > It just always happens. Someone says something vague, someone else guesses > something about it, someone else adds another guess, someone else adds another > guess, and before you know it, everything's way off track. I hate it when > that happens and I don't have time to jump and respond in until it's too late. Sorry for having contributed to wasted bandwidth, though I do hope you will give some thought to the ideas of splitting up the members list as I had suggested, and some kind of "recentness" limit to the search, though I realize that the underlying data structure may not result in those operations saving CPU time (and could possibly be more time consuming depending on implementation), I just feel they would be useful features for us the users. Frank

Subject:	Re: News search function temporarily disabled
Newsgroups:	lugnet.admin.general
Date:	Mon, 18 Dec 2000 06:08:07 GMT
Viewed:	1482 times

In lugnet.admin.general, Frank Filz writes: > Sorry for having contributed to wasted bandwidth, though I do hope you > will give some thought to the ideas of splitting up the members list as > I had suggested, Yes, definitely! And the flags can link to country pages that list just the people in the country. > and some kind of "recentness" limit to the search, > though I realize that the underlying data structure may not result in > those operations saving CPU time (and could possibly be more time > consuming depending on implementation), I just feel they would be useful > features for us the users. It actually won't incur extra overhead (amazingly, it will actually reduce overhead) to restrict things to hard or soft time ranges, so I'm going ahead with this: http://news.lugnet.com/admin/general/?n=2701 because it's an excellent opportunity to take that step. The guts have been there since August, 1999 -- it just needs to be passed a pair of time range parameters, which means making an advanced search page with some widgets for selecting times. There may also be an implicit improvement in word proximity matching (it will automatically favor close-proximity multiword matches over disparate-proximity multiword matches). Although I don't forsee adding anything anytime soon as rigorous as a "this exact string" sort of thing, proximity matching comes pretty darn close to that in pratice. (LUGNET's news search engine had this up until August of 1999. Unfortunately, it used a lot of memory, and I removed it while applying other improvements at that time. I believe I'll be able to add it back in without incurring a large memory or CPU time penalty if I interleave the multiword results carefully.) --Todd

Subject:	News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek, lugnet.announce
Followup-To:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Tue, 2 Jan 2001 07:46:46 GMT
Highlighted:	!! (details)
Viewed:	11214 times

The LUGNET News search function is now re-enabled. I completely revamped the index data structures and list-merge algorithm and rewrote the core query engine in C. It's a much more solid implementation. Everyone's patience during the outage is much appreciated! Functional improvements: * Word proximity sensitivity -- two or more words closer together match better than the same words far apart. * Word-order sensitivity -- words in a specific order match better than the same words out of order. For example, "new lego" returns different matches than "lego new" (try it!). * And you can still prefix words with + or - to require inclusion or exclusion, respectively. Cosmetic improvements: * Graphical horizontal bar showing match rankings. (Close matches appear with wider bars than lesser matches.) * More streamlined and easy-to-read results header. Internal improvements: * Approximately 100 times faster (once the query engine receives the request). Typical CPU utilization is less than 0.1 seconds even for queries that generate tens of thousands of word hits. (Actual times may vary depending on disk activity as word-hit lists are accessed.) * The query engine can take any arbitrary list of news articles as a search filter (include or exclude). This is how subgroup-searches are handled now and will pave the way for cooler things later. To do soon: * Implement date range restrictions on searches. Currently searches entire corpus of documents and assigns equal date-weight to all documents regardless of age. This is actually working now at the inner levels and at the URL level, but there is not yet a forms-based "advanced search" user interface for specifying a target date and proximity. To do someday: * Facilitate searching within specific threads. This is a low-level data list issue. * Facilitate searching within arbitrary collections of groups (as opposed to a single group or group hierarchy). This is mostly a user-interface issue. * Facilitate searching within search results (i.e., "search only within these results below"). * Rework the text indexer so that it doesn't throw out "funny characters" in words and then reindex the entire document corpus from scratch. (Currently, words like "won't" are converted to "wont" and words like "S@H" are ignored entirely.) * Filter out canceled articles from returned results. --Todd p.s. The article index database currently contains more than 10,000,000 word-hits.

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Tue, 2 Jan 2001 11:04:49 GMT
Viewed:	1482 times

In lugnet.admin.general, Todd Lehman writes: > The LUGNET News search function is now re-enabled. I completely revamped > the index data structures and list-merge algorithm and rewrote the core > query engine in C. It's a much more solid implementation. Very good! Although the new search doesn't return most recent articles first like it used to. Is that how it should work? Now I can't see most recent posts that contain the keyword I want to search for, which makes the search pretty much useless if it returns thousands of results. - Dan

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 14:01:42 GMT
Viewed:	1530 times

In lugnet.admin.general, Todd Lehman writes: > To do someday: > > * Facilitate searching within specific threads. This is a low-level data > list issue. > > * Facilitate searching within arbitrary collections of groups (as opposed to > a single group or group hierarchy). This is mostly a user-interface issue. > > * Facilitate searching within search results (i.e., "search only within these > results below"). > > * Rework the text indexer so that it doesn't throw out "funny characters" in > words and then reindex the entire document corpus from scratch. (Currently, > words like "won't" are converted to "wont" and words like "S@H" are ignored > entirely.) > > * Filter out canceled articles from returned results. Could I suggest some amendments to the "To do someday" list (for an 'advanced search' only)? - Search by author - Search by subject line contents - Search by date range (or open-ended-- i.e. after date X or before date X) - Search for articles containing a URL - option to ONLY return the heads of threads - And of course search by quoted string (i.e. +quack +"foo bar") Just some wishful thinking... DaveE

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Tue, 2 Jan 2001 17:31:23 GMT
Viewed:	1645 times

Todd Lehman wrote: > * Implement date range restrictions on searches. Currently searches entire > corpus of documents and assigns equal date-weight to all documents > regardless of age. This is actually working now at the inner levels and > at the URL level, but there is not yet a forms-based "advanced search" > user interface for specifying a target date and proximity. Could you tell us the URL syntax for those of us willing to modify URLs? -- Frank Filz ----------------------------- Work: mailto:ffilz@us.ibm.com (business only please) Home: mailto:ffilz@mindspring.com

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Tue, 2 Jan 2001 17:38:23 GMT
Viewed:	11106 times

In lugnet.admin.general, Dan Jezek writes: > In lugnet.admin.general, Todd Lehman writes: > > The LUGNET News search function is now re-enabled. I completely revamped > > the index data structures and list-merge algorithm and rewrote the core > > query engine in C. It's a much more solid implementation. > > Very good! Although the new search doesn't return most recent articles > first like it used to. Is that how it should work? Not yet, no. See "to-do soon" section in previous post. > Now I can't see most recent > posts that contain the keyword I want to search for, For now, if you don't mind URL mucking, you can manually append &qs=<number> to the URL and it will use that number (in seconds) as a time delta. For example, to limit posts to the last 10 days, use &qs=864000 Younger matches will tend toward the top and older matches will tend toward the bottom. Anything older than 10 days (in the example above) will be excluded. > which makes the > search pretty much useless if it returns thousands of results. It's actually in the nature of search engines to generate thousands of results. What's more important is the first page returned -- i.e., the ranking. Typically one doesn't dig down past the first few, so you rarely actually go visit all the thousands. If the results are ranked according o some time criteria, there'll still be thousands of results, except for super-restrictive time criteria. --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Tue, 2 Jan 2001 18:27:44 GMT
Viewed:	3520 times

In lugnet.admin.general, Todd Lehman writes: > > The LUGNET News search function is now re-enabled... <snip> > > To do soon: > > * Implement date range restrictions on searches. > In lugnet.admin.general, Dan Jezek writes: > ...which makes the search pretty much useless... I would have to concur. > For now, if you don't mind URL mucking, you can manually append > &qs=<number> to the URL and it will use that number (in seconds) > as a time delta. For example, to limit posts to the last 10 days, > use &qs=864000 < goes away and tries it... > Umm - Would it not make sense to simply include the appropriate qualifier on the system side? (I tried it and got two year old results for "qs", but I'm probably doing something wrong.) <snip> > To do someday: > * Facilitate searching within search results This would be a very useful feature - just like large search "engines" SRC

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 18:29:14 GMT
Viewed:	1499 times

In lugnet.admin.general, David Eaton writes: > Could I suggest some amendments to the "To do someday" list (for an > 'advanced search' only)? Sure! > - Search by author > - Search by subject line contents These are mostly covered already due to the way the indexing works -- words closer to the beginning of a document are given higher weights than words occurring later in a document. When the indexer chews on a news article, it first grabs the 'From:' line and then the 'Subject:' line, and finally the body. Somewhere in there, it also grabs the 'Keywords:' and 'Summary:' headers, if present. I know what you mean, though, about being able to restrict a search to _specifically_ some exact subject or author. I'll think about how I might be able to handle this in the future -- it would be a separate index database for each of the two fields. Maybe you'll later be able to make search queries like this... [medieval brikwars] (david eaton) tan baseplates ...which might mean "among articles with subjects matching 'medieval brickwars', find articles posted by names matching 'david eaton', and within those, find articles matching 'tan baseplates' in the body. Is that the sort of functionality you're looking for? > - And of course search by quoted string (i.e. +quack +"foo bar") This is also largely covered already by word-proximity sensitivity and word-order sensitivity -- although I definitely agree that it would be nice for super-refined searches. I'll think more about it. I think I know now of a way to do the "" thing efficiently. (I didn't months ago.) > - Search by date range (or open-ended-- i.e. after date X or before date X) OK, I just put this in during lunch...try it out... It's not a specific date, but relative dates... If you search for david eaton <10 then it'll show you things matching "david eaton" that were posted "about 10 days ago" (plus or minus 10 days -- with higher matches given to those that are closer to the 10-days-ago-point). Alternatively, if you search for david eaton <10 >1 then it'll show you things that were posted about 10 days ago, plus or minus 1 day. If you want to go back farther, you can give days/months/years, for example david eaton <15/6/1 would look for (favor) things that were posted 15 days, 6 months, and 1 year ago. To limit the range to that time period, plus or minus one week, do david eaton <15/6/1 >7 I'm thinkin' this'll be quite useful for digging up stuff that's "about a week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's tough to remember an exact date. fractals are amazing <7 building a bicycle wheel <0/3 sensors and methods for mobile robot positioning <//2 > - Search for articles containing a URL Not sure how to handle this yet. > - option to ONLY return the heads of threads Ahh yes -- I'll put that on the "must do" list. That won't be too hard once the thread lists are generated internally for other purposes. --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 19:00:58 GMT
Viewed:	1790 times

In lugnet.admin.general, Todd Lehman writes: > Is that the sort of functionality you're looking for? Pretty much... actually, I was thinking more along the lines of an advanced search form though: Search for: ______________________ (uses +'s and -'s as is) Search for text in subject line [] (checkbox) Posted by: _______________________ (uses +'s and -'s... or no symbols, too) Search only for heads of a thread [] (checkbox) Posted before: ___ /____ / ____ Posted after: ___ / ____/ ____ But throwing in symbols/wildcards on the command line instead of in a form works for me too :) > If you search for > > david eaton <10 > > then it'll show you things matching "david eaton" that were posted "about 10 > days ago" (plus or minus 10 days -- with higher matches given to those that > are closer to the 10-days-ago-point). So-- looks like if today is day 100, "david eaton <10" would search days 80-100, giving highest precedence to things closest to day 90... cool! I think that's actually really useful :) > I'm thinkin' this'll be quite useful for digging up stuff that's "about a > week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's > tough to remember an exact date. Poifect! That's something I had been wanting for a while since I'll know (for example) that I posted something last spring, but I want to make sure NOT to search for anything after, say, June, or before March... very cool indeed :) > > - Search for articles containing a URL > > Not sure how to handle this yet. Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm looking for someone who posted a link to their site... occasionally I've tried to do this by putting "http" on the query string, although it doesn't rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html". > > - option to ONLY return the heads of threads > > Ahh yes -- I'll put that on the "must do" list. That won't be too hard once > the thread lists are generated internally for other purposes. Cool! DaveE

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 19:04:19 GMT
Viewed:	1452 times

Todd Lehman wrote: > > - Search for articles containing a URL > > Not sure how to handle this yet. One thing which would generally make it pretty easy to find URLs is to index "http". When dealing with special characters, definitely treat "/" and "\" as word separators. Probably ":" also. -- Frank Filz ----------------------------- Work: mailto:ffilz@us.ibm.com (business only please) Home: mailto:ffilz@mindspring.com

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 19:07:34 GMT
Viewed:	1495 times

Todd Lehman wrote: > I know what you mean, though, about being able to restrict a search to > _specifically_ some exact subject or author. I'll think about how I might > be able to handle this in the future -- it would be a separate index database > for each of the two fields. Do you index the name in "X-real-life-name"? One thought, index the special strings "from:" and "subject:". The the search: from: ffilz Should rank my posts highly due to proximity. Of course it would be better to index the real life name as if it was preceded by from: also so that you could search: from: filz and find my posts. -- Frank Filz ----------------------------- Work: mailto:ffilz@us.ibm.com (business only please) Home: mailto:ffilz@mindspring.com

Subject:	RE: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Tue, 2 Jan 2001 23:03:02 GMT
Viewed:	1503 times

David Eaton writes: > In lugnet.admin.general, Todd Lehman writes: > > Is that the sort of functionality you're looking for? > > Pretty much... actually, I was thinking more along the lines of > an advanced search form though: > > Search for: ______________________ (uses +'s and -'s as is) > Search for text in subject line [] (checkbox) > Posted by: _______________________ (uses +'s and -'s... or no > symbols, too) > Search only for heads of a thread [] (checkbox) > Posted before: ___ /____ / ____ > Posted after: ___ / ____/ ____ > > But throwing in symbols/wildcards on the command line instead of in a form > works for me too :) I think a multi-field advanced search is the way to go...much easier to use, IMO--something like the DejaNews power search: http://www.deja.com/home_ps.shtml The actual search keywords are in one field, and then there are numerouse ways to limit the search. --Bram Bram Lambrecht bram@cwru.edu http://home.cwru.edu/~bxl34/

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Wed, 3 Jan 2001 01:51:23 GMT
Viewed:	1530 times

In lugnet.admin.general, Frank Filz writes: > Todd Lehman wrote: > > I know what you mean, though, about being able to restrict a search to > > _specifically_ some exact subject or author. I'll think about how I might > > be able to handle this in the future -- it would be a separate index > > database for each of the two fields. > > Do you index the name in "X-real-life-name"? Ya, let's see...as it assembles the text to index, first it grabs X-Real-Life-Name:, then it grabs either Original-From: or From:, then Subject:, then Keywords:, then Summary:, and then finally the non-quoted and non-sig parts of the body. So, for example, on your post that I'm replying to, it would generate frank filz frank filz re news search functoin reactivated was news search function temporarily disabled do you index the name in x real life name one thought index the special strings from and subject the the serach etc., etc. And then it would remove a few stopwords and then feed that to the actual indexer. > One thought, index the special strings "from:" and "subject:". The the > search: > > from: ffilz > > Should rank my posts highly due to proximity. Of course it would be > better to index the real life name as if it was preceded by from: also > so that you could search: > > from: filz > > and find my posts. Ah. That's a neat trick! It's a little English-centric, though, but it's still a very simple and elegant partial solution...one of those "80% of the benefits for only 20% of the work" types of things. Unfortunately, it would mean reindexing the entire news corpus from scratch, because they'd be insertions in the numeric word-order lists -- so it's probably something to do along with other additions the next time the index is rebuilt from scratch. The last time I rebuilt it, I think it took a whole day, and that was with about 1/4 as many articles in the system. (The indexer is optimized for fast incremental indexing rather than fast one-time building.) --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Wed, 3 Jan 2001 02:48:21 GMT
Viewed:	1849 times

In lugnet.admin.general, David Eaton writes: > In lugnet.admin.general, Todd Lehman writes: > > Is that the sort of functionality you're looking for? > > Pretty much... actually, I was thinking more along the lines of an advanced > search form though: > > Search for: ______________________ (uses +'s and -'s as is) > Search for text in subject line [] (checkbox) > Posted by: _______________________ (uses +'s and -'s... or no symbols, too) > Search only for heads of a thread [] (checkbox) > Posted before: ___ /____ / ____ > Posted after: ___ / ____/ ____ > > But throwing in symbols/wildcards on the command line instead of in a form > works for me too :) Ya, something like that'd be good to slap on top after the base functionality. :-) Nobody wants to *have* to remember how all the squiggly and square brace thingums in a search box work. :-) > > If you search for > > > > david eaton <10 > > > > then it'll show you things matching "david eaton" that were posted "about 10 > > days ago" (plus or minus 10 days -- with higher matches given to those that > > are closer to the 10-days-ago-point). > > So-- looks like if today is day 100, "david eaton <10" would search days > 80-100, giving highest precedence to things closest to day 90... Ya, precisely. It was originally (summer of '99) a smooth bell-shaped curve y = exp(-.5 * x^2) (x being amount of deviation from the target and y being the output function giving the fitness value) but that was chewing up 10^-6 seconds in one of the tight inner loops (i.e., wasting 0.07 CPU seconds on a word like 'lego' with ~70000 hits), so I threw that out and changed it to a linear 1-|x| shaped spike curve y = max(0, 1-|x|) instead. That doubled the overall throughput and still gave decent results. The main advantage of the bell curve y=exp(-.5*x^2) is that y>0 for all x, but that can also be a disadvantage. The sharp y=max(0,1-|x|) curve has a nice sudden cutoff at x<1 and x>1. :-) In a way I'm kinda glad the bell curve was so slow to compute > cool! I > think that's actually really useful :) > > > I'm thinkin' this'll be quite useful for digging up stuff that's "about a > > week ago" or "about 3 months ago" or "oh, about 2 years ago" -- when it's > > tough to remember an exact date. > > Poifect! That's something I had been wanting for a while since I'll know > (for example) that I posted something last spring, but I want to make sure > NOT to search for anything after, say, June, or before March... very cool > indeed :) Ya, I can't count the number of times I've wanted to go about "about so many days or weeks" to look for something. I'll remember something and not know exactly what date it was posted, but I'll remember roughly how long ago it was. So it's effectively doing a fuzzy date search with variable focus (wide, narrow, etc.). > > > - Search for articles containing a URL > > Not sure how to handle this yet. > Yeah, that's just one of those "Oh, if only" things-- mostly for when I'm > looking for someone who posted a link to their site... occasionally I've > tried to do this by putting "http" on the query string, although it doesn't > rule out posts that give their URLs as "www.foo.com/~blah/cool_page.html". I'm not sure why I included "http" in the stopword list. I apologize for that. (Probably because it would have generated zillions upon zillions of word hits. "com" is the most frequently indexed word here.) "http" could certainly stand to go back in now that the query engine is so much faster. Maybe even other words like "it" and "that" and "the." Here's a list of stopwords, BTW...are there any you see here that stand out in your mind as having given you problems in the past? a an the it its it's this that what i i'm im my we me us you do be am is are was can has of for from with to in out on off at as if and but or not no have so http www One thing I *really* hate about stopword lists is that they're so language- centric (i.e., a stopword in one language might be a darn-tootin' regular good word in another language). Also, all single-letter words are ignored (i.e., "a" and "i" for English and "y" for Spanish, etc.). > > > - option to ONLY return the heads of threads > > Ahh yes -- I'll put that on the "must do" list. That won't be too hard once > > the thread lists are generated internally for other purposes. > Cool! I've planned ahead here. For each group, there'll be a list of articles that comprise the heads-of-threads for those groups. That list can the be used to generate more compact views into the group, or it can also be fed into the query engine as an "include only these" filter. In memory, once loaded, the article filter lists are 1-bit flags -- 1 bit per article position -- so even a list of a quarter million articles consumes only 30 KB of memory for the fraction of a second that it's needed. --Todd

Subject:	Article bit-flags (was: Re: News search function reactivated)
Newsgroups:	lugnet.admin.general
Date:	Wed, 3 Jan 2001 03:57:06 GMT
Viewed:	1636 times

In lugnet.admin.general, Todd Lehman writes: > [...] > I've planned ahead here. For each group, there'll be a list of articles that > comprise the heads-of-threads for those groups. That list can the be used to > generate more compact views into the group, or it can also be fed into the > query engine as an "include only these" filter. In memory, once loaded, the > article filter lists are 1-bit flags -- 1 bit per article position -- so even > a list of a quarter million articles consumes only 30 KB of memory for the > fraction of a second that it's needed. Oh, one other thing...planning ahead: Another potential application of article bit-flags is read/unread lists on a person-by-person basis via the web interface. I know this is something that people have been asking for for a long time. When updating a flag, it would be as simple as a single disk read/write to change a single bit. And when mass-article filtering, it would simply load into memory as a memory-mapped file and would be extremely efficient in terms of both space and speed. It would even be easy to make customized message folders and move messages between them, delete messages, mark unread, etc. Would you like to be able to say, "search for 'blah blah blah' within my 'cool models' folder"? --Todd

Subject:	Re: Article bit-flags (was: Re: News search function reactivated)
Newsgroups:	lugnet.admin.general
Date:	Wed, 3 Jan 2001 04:03:46 GMT
Viewed:	1605 times

Todd Lehman wrote: > Oh, one other thing...planning ahead: > > Another potential application of article bit-flags is read/unread lists on a > person-by-person basis via the web interface. I know this is something that > people have been asking for for a long time. When updating a flag, it would > be as simple as a single disk read/write to change a single bit. And when > mass-article filtering, it would simply load into memory as a memory-mapped > file and would be extremely efficient in terms of both space and speed. It > would even be easy to make customized message folders and move messages > between them, delete messages, mark unread, etc. > > Would you like to be able to say, "search for 'blah blah blah' within my > 'cool models' folder"? Ooh ooh ooh... One thing I really really want is to be able to put messages into folders (if anyone knows of a decent newsreader which allows such - please let me know - it would be preferable that it do so without requiring me to store the message). Frank

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 04:17:03 GMT
Viewed:	3644 times

In lugnet.admin.general, Todd Lehman writes: > In lugnet.admin.general, Dan Jezek writes: > > In lugnet.admin.general, Todd Lehman writes: > > > The LUGNET News search function is now re-enabled. I completely revamped > > > the index data structures and list-merge algorithm and rewrote the core > > > query engine in C. It's a much more solid implementation. > > > > Very good! Although the new search doesn't return most recent articles > > first like it used to. Is that how it should work? > For now, if you don't mind URL mucking, you can manually append > > &qs=<number> > > to the URL and it will use that number (in seconds) as a time delta. For > example, to limit posts to the last 10 days, use > > &qs=864000 It works great! ... but the &qs doesn't carry over to the next page of results. So if I want to see more pages, I have to edit the querystring on each page. Since you already have the inner workings of this in place, it would be really easy to just add a textbox named "qs" and add the &qs= to the bottom "5 more, 10 more"... links. With a little more effort, you could include radio buttons to have the user select how many days, months or years they want to go back and have your search engine convert it to milliseconds depending on what the user selects. > It's actually in the nature of search engines to generate thousands of > results. If given thousands of results, most search engines have some advanced options like sorting. > What's more important is the first page returned -- i.e., the ranking. > Typically one doesn't dig down past the first few, so you rarely > actually go visit all the thousands. I'd be interested in seeing some statistics on how far the average user goes when given back let's say 10, 100 and 1,000 pages of results. It would help in the design of an effective search engine.

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 05:16:07 GMT
Viewed:	35790 times

In lugnet.admin.general, Dan Jezek writes: > > [...] > > example, to limit posts to the last 10 days, use > > > > &qs=864000 > > It works great! ... but the &qs doesn't carry over to the next page of > results. So if I want to see more pages, I have to edit the querystring on > each page. oops, doy! I didn't put in the propagation of that URL term. I don't consider it 100% "documented" yet (it's still subject to change without notice), but I still shouldn't have missed that. Thanks. I'll fix that. The reason it's subject to change is partially because the letter 's' in 'qs' is named after the word (or greek letter, rather) 'sigma' -- sigma being 1 standard deviation in the bell curve function f(x) = exp(-x^2/2) -- and that formula isn't being used anymore in the searches, and partially because 'qs' might better someday be used for "query subject." Anyway, it's still not 100% in stone. But it'll work until it breaks. > Since you already have the inner workings of this in place, it > would be really easy to just add a textbox named "qs" and add the &qs= to > the bottom "5 more, 10 more"... links. With a little more effort, you could > include radio buttons to have the user select how many days, months or years > they want to go back and have your search engine convert it to milliseconds > depending on what the user selects. Yup, that's the idea!!! Say, where is that old article about sigma and advanced options...ah! so easy to find now! :-) http://news.lugnet.com/?q=url+query+qs+qt+sigma+%3C//1.5 (See topmost result and related thread for more background.) > > It's actually in the nature of search engines to generate thousands of > > results. > > If given thousands of results, most search engines have some advanced > options like sorting. Well, they -are- sorted. They're always sorted -- always highest probability of relevance first, lowest last. Usually, the metric for relevance is a combination of non-temporal factors such as word frequencies, word proximities, and word orderings. I don't know of any search engine that doesn't sort (on some criteria) the matches it finds. But anyway, I think you meant sorting by time? I wonder if a little link at the top to re-deploy the search taking recentness into account (or conversely, turning it off if it's on) would be useful? > > What's more important is the first page returned -- i.e., the ranking. > > Typically one doesn't dig down past the first few, so you rarely > > actually go visit all the thousands. > > I'd be interested in seeing some statistics on how far the average user goes > when given back let's say 10, 100 and 1,000 pages of results. It would help > in the design of an effective search engine. Me too. I'd expect a f(x)=1/x type of curve, but it would be fun to see actual numbers. :-) --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 06:16:20 GMT
Viewed:	9886 times

In lugnet.admin.general, Todd Lehman writes: > oops, doy! I didn't put in the propagation of that URL term. I don't >consider > it 100% "documented" yet (it's still subject to change without notice), but I > still shouldn't have missed that. Thanks. I'll fix that. > The reason it's subject to change is partially because the letter 's' in 'qs' > is named after the word (or greek letter, rather) 'sigma' -- sigma being 1 > standard deviation in the bell curve function f(x) = exp(-x^2/2) -- and that > formula isn't being used anymore in the searches, and partially because 'qs' > might better someday be used for "query subject." Anyway, it's still not 100% > in stone. But it'll work until it breaks. Wow! So you have terms for the ampersand options in a URL? My standpoint on this would be to put everything in a form and kill 2 birds with 1 stone - not having to think of how to name URL terms (unless you enjoy doing that) and having the search more user-friendly (not everyone will remember the options or find it easy to edit the URL). > > If given thousands of results, most search engines have some advanced > > options like sorting. > > Well, they -are- sorted. They're always sorted -- always highest probability > of relevance first, lowest last. Usually, the metric for relevance is a > combination of non-temporal factors such as word frequencies, word >proximities, > and word orderings. I don't know of any search engine that doesn't sort (on > some criteria) the matches it finds. But anyway, I think you meant sorting > by time? No, I meant having the option to pick between what I want the results to be sorting on. Dejanews has a great power search: http://www.deja.com/home_ps.shtml which includes the option to sort by relevance, subject, forum, author and date. That's how I would like to see the sort options here. But knowing that you most likely don't have the resources that dejanews has and how flawlessly Lugnet runs on the current setup, I'm satisfied with editing the URL for now :-) > > I'd be interested in seeing some statistics on how far the average user goes > > when given back let's say 10, 100 and 1,000 pages of results. It would help > > in the design of an effective search engine. > > Me too. I'd expect a f(x)=1/x type of curve, but it would be fun to see >actual numbers. :-) It could be done. Include another version of jump.cgi into the 5 more, 10 more... on the search results page and log the number of results returned, the IP address and the query subject. Then run an average, min, max query grouped by all 3 fields. Sounds complicated, depends on how badly you want to see the results. I wouldn't want to go through the process of implementing that but would really like to see the results :-)

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general, lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 14:11:53 GMT
Viewed:	19914 times

In lugnet.admin.general, Dan Jezek writes: > Wow! So you have terms for the ampersand options in a URL? My standpoint > on this would be to put everything in a form and kill 2 birds with 1 stone - > not having to think of how to name URL terms (unless you enjoy doing that) > and having the search more user-friendly (not everyone will remember the > options or find it easy to edit the URL). Ya, exactly -- first name the URL components carefully and then put a user- friendly level on top of it. Best of both worlds. > No, I meant having the option to pick between what I want the results to be > sorting on. Dejanews has a great power search: > > http://www.deja.com/home_ps.shtml > > which includes the option to sort by relevance, subject, forum, author and > date. That's how I would like to see the sort options here. Ah, I see. Yeah, that could be helpful in certain cases, if you're scouring tons of results! I've needed to look things up on Deja.com, so I know what you mean. > But knowing > that you most likely don't have the resources that dejanews has and how > flawlessly Lugnet runs on the current setup, I'm satisfied with editing the > URL for now :-) There's an alternate form that avois the &qs= thingie, so you don't have to edit the URLs: http://news.lugnet.com/admin/general/?n=8613 > It could be done. Include another version of jump.cgi into the 5 more, 10 > more... on the search results page and log the number of results returned, > the IP address and the query subject. These don't actually run through jump.cgi. But they're already logged by httpd anyway. (That's how the jump.cgi logging is implemented as well.) > Then run an average, min, max query > grouped by all 3 fields. Sounds complicated, depends on how badly you want > to see the results. I wouldn't want to go through the process of > implementing that but would really like to see the results :-) Hmm, it's all there now, except for logging the number of results produced. I guess it could be as simple as open for append, flock, print, and close on a filehandle inside of the search page...lemme think about it. Analyzing the results and making a graph would be a snap with gnuplot. I think it would be especially fun to compare the graph now to the way it was (would have been) before the change...but alas, that data was never captured for the old query engine and it's too late now. --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Wed, 3 Jan 2001 14:44:55 GMT
Viewed:	1649 times

In lugnet.admin.general, Todd Lehman writes: > I'm not sure why I included "http" in the stopword list. I apologize for > that. (Probably because it would have generated zillions upon zillions of > word hits. "com" is the most frequently indexed word here.) > > "http" could certainly stand to go back in now that the query engine is so > much faster. Maybe even other words like "it" and "that" and "the." As an algorithmical guess, I think I'd probably attempt something a bit different... If someone enters: the I'd probably want to ignore it. But if they entered: the best design I might want to consider the 'the'. Dunno. I'd probably test an algorithm that ignored all stopwords in the initial search, but then for each result of the initial search, score it with respect to any stopwords found (& proximity, etc). > Here's a list of stopwords, BTW...are there any you see here that stand out > in your mind as having given you problems in the past? > > a an the it its it's this that what > i i'm im my we me us you > do be am is are was can has > of for from with to in out on off at as if and but or not no have so > http www The only things here that've given me problems (or that I'd expect might give others problems [apart from language differences]) are http and www. And now that I look at it, I've got to ask, if I specified "http" as a search parameter, would it in fact score every post made with the webserver as having an instance thanks to the X-Nntp-Gateway? Same with www, I suppose, except that all new webserver posts are from 'news.lugnet.com' instead-- but all older posts would show up? Or are various header info ignored? > One thing I *really* hate about stopword lists is that they're so language- > centric (i.e., a stopword in one language might be a darn-tootin' regular > good word in another language). It does present a rather unique problem (unless the search proves capable of handling stopwords)... Fortunately the vast majority of posts seem to be in one language for now, even though there has been a distinct increase in some other languages in the last year and a half or so :) > I've planned ahead here. For each group, there'll be a list of articles that > comprise the heads-of-threads for those groups. That list can the be used to > generate more compact views into the group, or it can also be fed into the > query engine as an "include only these" filter. In memory, once loaded, the > article filter lists are 1-bit flags -- 1 bit per article position -- so even > a list of a quarter million articles consumes only 30 KB of memory for the > fraction of a second that it's needed. Very cool! You mention later the prospect of having 'folders' for things like 'cool sites' or 'space models' or 'great building ideas' for members using the web-interface... Would this be something user-configurable? Could I create as many folders as I would want, or would there be a set number of X folders that would be pre-determined? Either way (assuming people wouldn't make hundreds of folders for themselves) it sounds like there's not a space concern (my worst estimates of millions of posts and thousands of users weren't all that bad considering). And as long as the interface isn't doing something dynamic with the folders (making little action icons for each folder on article listing pages, etc), sounds like there wouldn't be a real time strain either... A very cool idea! DaveE

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 16:34:22 GMT
Viewed:	1061 times

In lugnet.admin.general, Todd Lehman writes: > The LUGNET News search function is now re-enabled. I completely revamped > the index data structures and list-merge algorithm and rewrote the core > query engine in C. It's a much more solid implementation. All geeks capitulate sooner or later on perl vs C. Of course Larry (and many others I work with) would tell you to write that stuff in Java but that would be a step backwards. Congratulations! Of course it's stability depends greatly on your dilligence in policing your pointers. KL

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 17:21:47 GMT
Viewed:	1109 times

In lugnet.off-topic.geek, Kevin Loch writes: > In lugnet.admin.general, Todd Lehman writes: > > The LUGNET News search function is now re-enabled. I completely revamped > > the index data structures and list-merge algorithm and rewrote the core > > query engine in C. It's a much more solid implementation. > All geeks capitulate sooner or later on perl vs C. Of course > Larry (and many others I work with) would tell you to write that > stuff in Java but that would be a step backwards. I like Java. But this really needed to be close to the metal and generate code that would fit in the L1 cache for the non-memory-bus-bound portions of the loops. The GNU C compiler is incredible. > Congratulations! Of course it's stability depends greatly on your > dilligence in policing your pointers. It's the best C code I've written in 12 years. I think being away from C for 3 1/2 years has helped. --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 20:41:42 GMT
Viewed:	1126 times

In lugnet.off-topic.geek, Todd Lehman writes: > In lugnet.off-topic.geek, Kevin Loch writes: > > In lugnet.admin.general, Todd Lehman writes: > > > The LUGNET News search function is now re-enabled. I completely revamped > > > the index data structures and list-merge algorithm and rewrote the core > > > query engine in C. It's a much more solid implementation. > > All geeks capitulate sooner or later on perl vs C. Of course > > Larry (and many others I work with) would tell you to write that > > stuff in Java but that would be a step backwards. > > I like Java. But this really needed to be close to the metal and generate > code that would fit in the L1 cache for the non-memory-bus-bound portions > of the loops. The GNU C compiler is incredible. Aah, you definately couldn't have done that in Java. Of course the ability to declare a couple register variables helps too. KL

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 22:39:52 GMT
Viewed:	1138 times

Kevin Loch wrote: > > In lugnet.off-topic.geek, Todd Lehman writes: > > In lugnet.off-topic.geek, Kevin Loch writes: > > > In lugnet.admin.general, Todd Lehman writes: > > > > The LUGNET News search function is now re-enabled. I completely revamped > > > > the index data structures and list-merge algorithm and rewrote the core > > > > query engine in C. It's a much more solid implementation. > > > All geeks capitulate sooner or later on perl vs C. Of course > > > Larry (and many others I work with) would tell you to write that > > > stuff in Java but that would be a step backwards. > > > > I like Java. But this really needed to be close to the metal and generate > > code that would fit in the L1 cache for the non-memory-bus-bound portions > > of the loops. The GNU C compiler is incredible. > > > Aah, you definately couldn't have done that in Java. Of course the ability > to declare a couple register variables helps too. Of course I've heard of situations where an interpreter outdid hand crafted assembler. This can occur if the portion of the interpreter necessary to run your code fits in the code cache and the byte codes fit in the data cache when the hand crafted assembler code wouldn't fit in the code cache. If the code is processing a large quantity of data, but not revisiting the data, the data cache may be irrelevant for the processed data (there can even be a little bit of references if they are still relatively local - for example if you were doing some kind of vector computation where res[i] = f(data[i], data[i+1])). -- Frank Filz ----------------------------- Work: mailto:ffilz@us.ibm.com (business only please) Home: mailto:ffilz@mindspring.com

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 3 Jan 2001 23:18:06 GMT
Viewed:	1137 times

In lugnet.off-topic.geek, Kevin Loch writes: > > I like Java. But this really needed to be close to the metal and generate > > code that would fit in the L1 cache for the non-memory-bus-bound portions > > of the loops. The GNU C compiler is incredible. > Aah, you definately couldn't have done that in Java. I probably couldn't, no, but a very experienced Java programmer and a good JVM machine could conceivably do better than C. (It's not unheard of for Java to be faster than C for certain types of things.) The big hits would probably be the JVM environment startup and the array boundary checking. Anyway, ya, I wouldn't wanna try this sort of thing in Java...I'm much more comfortable with C when it comes to this sort of metal grinding. :-) > Of course the ability to declare a couple register variables helps too. Well, any good C compiler these days actually ignores the 'register' keyword. (Right?) The compiler does a better job of register allocation than a human does (when it comes to assigning C variables to registers), especially on modern superscalar pipelined CPU architectures. --Todd

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Thu, 4 Jan 2001 05:07:43 GMT
Viewed:	1751 times

In lugnet.admin.general, David Eaton writes: > As an algorithmical guess, I think I'd probably attempt something a bit > different... If someone enters: > the > I'd probably want to ignore it. But if they entered: > the best design > I might want to consider the 'the'. Dunno. I'd probably test an algorithm > that ignored all stopwords in the initial search, but then for each result > of the initial search, score it with respect to any stopwords found (& > proximity, etc). Oh -- actually, what search engines typically do on queries (and I just finally added this last week) is downvalue relatively common words and upvalue relatively uncommon words -- what's called "term ranking" or "term weighting." For example, when you search for "lego duplo", there are (currently) about 65,000 documents accounting for about 140,000 word-hits of "lego" and only about 1500 documents accounting for only about 3300 word-hits of "duplo". So the importance of the word "duplo" is very high relative to the word "lego". Similarly, there are tons of articles with "david" but only about one fifth as many with "eaton" -- so a search for "david eaton" has an easier time finding David Eaton among all the other Davids. > The only things here that've given me problems (or that I'd expect might > give others problems [apart from language differences]) are http and www. > And now that I look at it, I've got to ask, if I specified "http" as a > search parameter, would it in fact score every post made with the webserver > as having an instance thanks to the X-Nntp-Gateway? Same with www, I > suppose, except that all new webserver posts are from 'news.lugnet.com' > instead-- but all older posts would show up? Or are various header info > ignored? Right. It only includes these headers in the indexed text: X-Real-Life-Name: Original-From: From: Subject: Keywords: Summary: Other headers are ignored. Quoted content is also ignored. And it also does its best to ignore lines like "on such and such a date, so and so wrote..." and sigs. > Very cool! You mention later the prospect of having 'folders' for things > like 'cool sites' or 'space models' or 'great building ideas' for members > using the web-interface... Would this be something user-configurable? Yes. > Could I create as many folders as I would want, or would there be a set > number of X folders that would be pre-determined? Unread, Read, Save, and Trash would probably be the only predefined folders. All the default names would be renamable to whatever name you want. (I don't want them to have to be in English.) You'll prolly be able to set any bitfields you want on any folder. Bitfield options would be things like "this is the trash folder" and "this is an incoming folder." The Trash folder would simply be a folder named "Trash" with the "this is a trash folder" property set. The Unread folder would simply be a folder named "Unread" that was configured to receive new articles from some mix of groups you defined. As soon as you read an article, it would hop to some other folder -- e.g., the Read folder or wherever. The Save folder wouldn't be anything special -- just a regular folder. If you deleted something, it would go to the Trash folder. If you delete something from the Trash or empty the Trash, it goes away for good (away from your personal lists, that is...it would still be on the newsserver, naturally). > Either way (assuming people wouldn't make hundreds of folders for > themselves) it sounds like there's not a space concern (my worst estimates > of millions of posts and thousands of users weren't all that bad > considering). Ya, with the right data structures, space isn't an issue. Sparse bitfields and ID lists are the way to go. --Todd > And as long as the interface isn't doing something dynamic > with the folders (making little action icons for each folder on article > listing pages, etc), sounds like there wouldn't be a real time strain > either... A very cool idea! > > DaveE

Subject:	Re: News search function reactivated (was: News search function temporarily disabled)
Newsgroups:	lugnet.admin.general
Date:	Mon, 5 Feb 2001 01:41:53 GMT
Viewed:	1803 times

In lugnet.admin.general, Todd Lehman writes: > > > It's actually in the nature of search engines to generate thousands of > > > results. > > > > If given thousands of results, most search engines have some advanced > > options like sorting. > > Well, they -are- sorted. They're always sorted -- always highest probability > of relevance first, lowest last. Usually, the metric for relevance is a > combination of non-temporal factors such as word frequencies, word > proximities, and word orderings. I don't know of any search engine that > doesn't sort (on some criteria) the matches it finds. But anyway, I think > you meant sorting by time? > > I wonder if a little link at the top to re-deploy the search taking recentness > into account (or conversely, turning it off if it's on) would be useful? Todd, this would be really useful. I'll often search for a recent post, only remembering the poster's name and maybe one or two key-words, and that the post was in the past few days. I don't need two year old messages nearly as frequently. Could you change the display so that when results have the same score more recent posts are displayed first? I think a score weighting for recentness would be even more useful as part of the default setting -- obviously that would be up to you. --DaveL