Subject:
|
reverse indexes (was Re: Text::Query)
|
Newsgroups:
|
lugnet.off-topic.geek
|
Date:
|
Wed, 25 Aug 1999 14:45:34 GMT
|
Viewed:
|
1625 times
|
| |
![Post a public reply to this message](/news/icon-reply.gif) | |
In lugnet.off-topic.geek, Todd Lehman writes:
> [...]Each word gets its own file, for example
> jeremy => .../j/je/jeremy. [...]
> and inside the file, the entries look like this (this is a snapshot of the
> last 16 lines of the 'jeremy' file):
> admin.general:2701 935490152 19 590 595 600 605
> admin.general:2702 935512318 1
> [...]
> off-topic.geek:420 935513680 1
> cad.dev:2768 935516545 1
> off-topic.fun:dear-lego:440
> [...]
> [the 'jeremy' file] currently contains 1486 lines and is 38197 bytes. [...]
>
> The total amount of disk space currently taken by all articles in the system
> is 131 MB. The word index takes up 155 MB [...] So it's all still rather
> nice and small.
I'm having trouble figuring out how that's even possible. For example, the
first sample line in the 'jeremy' file above appears to say that 'jeremy'
occurs as the 19th, 590th, 595th 600th and 605th words of
http://www.lugnet.com/admin/general/?n=2701 (which looks more or less right if
you look at the message). So you used one line in the index for 5 words in the
message.
But most words occur only once in an article, right? So there ought to be a
whole lot of overhead for the article ID and datestamp due to having lots of
index lines with only one word each. I would think your index would be about 5
times the size of the articles themselves!
- Robert Munafo http://www.mrob.com/
LEGO: TC+++(8480) SW++ #+ S-- LS++ Hsp M+ A@ LM++ YB64m IC13
|
|
Message has 1 Reply: ![](/news/x.gif) | | Re: reverse indexes (was Re: Text::Query)
|
| (...) There is a lot of overhead, yes, but it's cancelled out by: 1) NNTP headers are ignored during indexing (except the Subject and From values). 2) In the NNTP article body, the following are ignored during indexing: - Canonically quoted content (...) (25 years ago, 26-Aug-99, to lugnet.off-topic.geek)
|
Message is in Reply To:
![](/news/x.gif) | | Re: Text::Query
|
| (...) Not very magical, no. It breaks text apart by anything non-alphanumeric, where the "alpha" part includes ISO-8859-1 international letters like ã, ñ, ß, and ø, etc. It converts everything to lowercase for indexing and collapses apostrophes. (...) (25 years ago, 25-Aug-99, to lugnet.off-topic.geek)
|
11 Messages in This Thread: ![Tan color again? -Tamyra Teed ("Mookie") (27-Jun-99 to lugnet.build, lugnet.general)](/news/x.gif) ![](/news/46.gif) ![Re: Tan color again? -Todd Lehman (27-Jun-99 to lugnet.build, lugnet.general)](/news/x.gif) ![](/news/46.gif) ![Re: Tan color again? -Jeremy H. Sproat (29-Jun-99 to lugnet.general, lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Text::Query -Todd Lehman (29-Jun-99 to lugnet.general, lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Re: Text::Query -Todd Lehman (22-Aug-99 to lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Re: Text::Query -Jeremy Sproat (24-Aug-99 to lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Re: Text::Query -Todd Lehman (25-Aug-99 to lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![You are here](/news/here.gif) ![](/news/46.gif) ![Re: reverse indexes (was Re: Text::Query) -Todd Lehman (26-Aug-99 to lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Re: reverse indexes (was Re: Text::Query) -Matthew Miller (26-Aug-99 to lugnet.off-topic.geek)](/news/x.gif) ![](/news/46.gif) ![Indexing on Keywords and Summary headers (was: Re: reverse indexes) -Todd Lehman (27-Aug-99 to lugnet.off-topic.geek, lugnet.admin.general)](/news/x.gif)
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|