To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.off-topic.geekOpen lugnet.off-topic.geek in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Off-Topic / Geek / 427
426  |  428
Subject: 
Re: reverse indexes (was Re: Text::Query)
Newsgroups: 
lugnet.off-topic.geek
Date: 
Thu, 26 Aug 1999 02:04:06 GMT
Viewed: 
1373 times
  
In lugnet.off-topic.geek, "Robert Munafo" <munafo@NOgcctech.SPAMcom> writes:

The total amount of disk space currently taken by all articles in the
system is 131 MB.  The word index takes up 155 MB [...] So it's all
still rather nice and small.

I'm having trouble figuring out how that's even possible. For example, the
first sample line in the 'jeremy' file above appears to say that 'jeremy'
occurs as the 19th, 590th, 595th 600th and 605th words of
http://www.lugnet.com/admin/general/?n=2701 (which looks more or less
right if you look at the message). So you used one line in the index for
5 words in the message.

But most words occur only once in an article, right? So there ought to be a
whole lot of overhead for the article ID and datestamp due to having lots of
index lines with only one word each. I would think your index would be about 5
times the size of the articles themselves!

There is a lot of overhead, yes, but it's cancelled out by:

1) NNTP headers are ignored during indexing (except the Subject and From
   values).

2) In the NNTP article body, the following are ignored during indexing:
   - Canonically quoted content (including lines such as "In lugnet.foo.foo,
     Foo foo <foo@foo.foo> wrote:" and similar forms)
   - Non-canonically quoted content, wherever possible (text set off by
     abominations such as "----Original Message----")
   - Canonically and semi-canonically formatted sigs
   - About 50 stopwords such as a, an, the, it, its, be, am, is, etc.

3) The inode size of the disk is 1K, so a 1.5K article wastes .5K of disk,
   and a 2.1K article wastes .9K of disk.  The 7007 articles which at
   present appear in lugnet.general newsgroup, for example, take up 14633K
   of disk space, even though they total only 11014K of actual data (25%
   of the used space is waste).

   There's a lot of wastage in the index as well, but mostly in the words
   that occur infrequently.  The 2083 indexed words which at present begin
   with "co", for example, take up 4942K of disk space, even though they
   only total 3202K of actual data (35% of the used space is waste).  As the
   index grows, fewer and fewer words are new to the index, and it becomes
   slowly less wasteful of disk space.  A relatively common word like
   "robotics" currently takes up 95K of disk space from 96735 bytes (only
   .6% waste), while a relatively uncommon word like "robotix" takes up 1K
   from 131 bytes (87% waste).

Anyway, so it all kind of balances out.  Disk space is cheap and only
getting cheaper.  :)

--Todd



Message has 1 Reply:
  Re: reverse indexes (was Re: Text::Query)
 
(...) What about "Keywords"? I'm not sure if anyone actually uses that, but seems worth keeping... (25 years ago, 26-Aug-99, to lugnet.off-topic.geek)

Message is in Reply To:
  reverse indexes (was Re: Text::Query)
 
(...) I'm having trouble figuring out how that's even possible. For example, the first sample line in the 'jeremy' file above appears to say that 'jeremy' occurs as the 19th, 590th, 595th 600th and 605th words of (URL) (which looks more or less (...) (25 years ago, 25-Aug-99, to lugnet.off-topic.geek)

11 Messages in This Thread:

Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR