Geeking : 427


Off-Topic / Geek / 427	426 \| 428

Subject:	Re: reverse indexes (was Re: Text::Query)
Newsgroups:	lugnet.off-topic.geek
Date:	Thu, 26 Aug 1999 02:04:06 GMT
Viewed:	1656 times

In lugnet.off-topic.geek, "Robert Munafo" <munafo@NOgcctech.SPAMcom> writes: > > The total amount of disk space currently taken by all articles in the > > system is 131 MB. The word index takes up 155 MB [...] So it's all > > still rather nice and small. > > I'm having trouble figuring out how that's even possible. For example, the > first sample line in the 'jeremy' file above appears to say that 'jeremy' > occurs as the 19th, 590th, 595th 600th and 605th words of > http://www.lugnet.com/admin/general/?n=2701 (which looks more or less > right if you look at the message). So you used one line in the index for > 5 words in the message. > > But most words occur only once in an article, right? So there ought to be a > whole lot of overhead for the article ID and datestamp due to having lots of > index lines with only one word each. I would think your index would be about 5 > times the size of the articles themselves! There is a lot of overhead, yes, but it's cancelled out by: 1) NNTP headers are ignored during indexing (except the Subject and From values). 2) In the NNTP article body, the following are ignored during indexing: - Canonically quoted content (including lines such as "In lugnet.foo.foo, Foo foo <foo@foo.foo> wrote:" and similar forms) - Non-canonically quoted content, wherever possible (text set off by abominations such as "----Original Message----") - Canonically and semi-canonically formatted sigs - About 50 stopwords such as a, an, the, it, its, be, am, is, etc. 3) The inode size of the disk is 1K, so a 1.5K article wastes .5K of disk, and a 2.1K article wastes .9K of disk. The 7007 articles which at present appear in lugnet.general newsgroup, for example, take up 14633K of disk space, even though they total only 11014K of actual data (25% of the used space is waste). There's a lot of wastage in the index as well, but mostly in the words that occur infrequently. The 2083 indexed words which at present begin with "co", for example, take up 4942K of disk space, even though they only total 3202K of actual data (35% of the used space is waste). As the index grows, fewer and fewer words are new to the index, and it becomes slowly less wasteful of disk space. A relatively common word like "robotics" currently takes up 95K of disk space from 96735 bytes (only .6% waste), while a relatively uncommon word like "robotix" takes up 1K from 131 bytes (87% waste). Anyway, so it all kind of balances out. Disk space is cheap and only getting cheaper. :) --Todd

Message has 1 Reply:

		Re: reverse indexes (was Re: Text::Query)
(...) What about "Keywords"? I'm not sure if anyone actually uses that, but seems worth keeping... (26 years ago, 26-Aug-99, to lugnet.off-topic.geek)

Message is in Reply To:

		reverse indexes (was Re: Text::Query)
(...) I'm having trouble figuring out how that's even possible. For example, the first sample line in the 'jeremy' file above appears to say that 'jeremy' occurs as the 19th, 590th, 595th 600th and 605th words of (URL) (which looks more or less (...) (26 years ago, 25-Aug-99, to lugnet.off-topic.geek)

11 Messages in This Thread:

Entire Thread on One Page:: Nested: All | Brief | Compact | Dots
Linear: All | Brief | Compact
This Message and its Replies on One Page:: Nested: All | Brief | Compact | Dots
Linear: All | Brief | Compact

Custom Search