Subject:
|
Re: reverse indexes (was Re: Text::Query)
|
Newsgroups:
|
lugnet.off-topic.geek
|
Date:
|
Thu, 26 Aug 1999 02:04:06 GMT
|
Viewed:
|
1470 times
|
| |
| |
In lugnet.off-topic.geek, "Robert Munafo" <munafo@NOgcctech.SPAMcom> writes:
> > The total amount of disk space currently taken by all articles in the
> > system is 131 MB. The word index takes up 155 MB [...] So it's all
> > still rather nice and small.
>
> I'm having trouble figuring out how that's even possible. For example, the
> first sample line in the 'jeremy' file above appears to say that 'jeremy'
> occurs as the 19th, 590th, 595th 600th and 605th words of
> http://www.lugnet.com/admin/general/?n=2701 (which looks more or less
> right if you look at the message). So you used one line in the index for
> 5 words in the message.
>
> But most words occur only once in an article, right? So there ought to be a
> whole lot of overhead for the article ID and datestamp due to having lots of
> index lines with only one word each. I would think your index would be about 5
> times the size of the articles themselves!
There is a lot of overhead, yes, but it's cancelled out by:
1) NNTP headers are ignored during indexing (except the Subject and From
values).
2) In the NNTP article body, the following are ignored during indexing:
- Canonically quoted content (including lines such as "In lugnet.foo.foo,
Foo foo <foo@foo.foo> wrote:" and similar forms)
- Non-canonically quoted content, wherever possible (text set off by
abominations such as "----Original Message----")
- Canonically and semi-canonically formatted sigs
- About 50 stopwords such as a, an, the, it, its, be, am, is, etc.
3) The inode size of the disk is 1K, so a 1.5K article wastes .5K of disk,
and a 2.1K article wastes .9K of disk. The 7007 articles which at
present appear in lugnet.general newsgroup, for example, take up 14633K
of disk space, even though they total only 11014K of actual data (25%
of the used space is waste).
There's a lot of wastage in the index as well, but mostly in the words
that occur infrequently. The 2083 indexed words which at present begin
with "co", for example, take up 4942K of disk space, even though they
only total 3202K of actual data (35% of the used space is waste). As the
index grows, fewer and fewer words are new to the index, and it becomes
slowly less wasteful of disk space. A relatively common word like
"robotics" currently takes up 95K of disk space from 96735 bytes (only
.6% waste), while a relatively uncommon word like "robotix" takes up 1K
from 131 bytes (87% waste).
Anyway, so it all kind of balances out. Disk space is cheap and only
getting cheaper. :)
--Todd
|
|
Message has 1 Reply:
Message is in Reply To:
| | reverse indexes (was Re: Text::Query)
|
| (...) I'm having trouble figuring out how that's even possible. For example, the first sample line in the 'jeremy' file above appears to say that 'jeremy' occurs as the 19th, 590th, 595th 600th and 605th words of (URL) (which looks more or less (...) (25 years ago, 25-Aug-99, to lugnet.off-topic.geek)
|
11 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|