To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.off-topic.geekOpen lugnet.off-topic.geek in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Off-Topic / Geek / 422
421  |  423
Subject: 
Re: Text::Query
Newsgroups: 
lugnet.off-topic.geek
Date: 
Wed, 25 Aug 1999 02:54:56 GMT
Viewed: 
1724 times
  
In lugnet.off-topic.geek, Sproaticus <jsproat@io.com> writes:
Todd Lehman wrote:
In lugnet.general, Todd Lehman writes:
In lugnet.general, Jeremy H. Sproat writes:
Whoa!  Todd dude!  You're using Text::Query, aren't you?  Come on,
fess up.
No, it's using a homebrew.  I probably should look into Text::Query though,
if it's loaded with callbacks and/or easy member-function overloading.
OK, I just took a look at Text::Query.  From what I can tell from reading
the docs and the source, it looks as though it's a brute force text scanner
rather than an inverted-index generator.

Yah.  I actually was getting excited over the similarity in syntax; I hadn't
even thought of the need for indexing such a huge database as LUGNET.  BTW,
what kind of scanner are you using for the index builder?  Does it just
break apart words separated by whitespace / non-alphanumeric / etc. and keep
a running word/URI count, or is it something more magical?

Not very magical, no.  It breaks text apart by anything non-alphanumeric,
where the "alpha" part includes ISO-8859-1 international letters like ã, ñ,
ß, and ø, etc.  It converts everything to lowercase for indexing and
collapses apostrophes.  Hyphenated words and words with dots or underscores
get split into multiple words (not ideal).  No stemming.

Each word gets its own file, for example

   jeremy  => .../j/je/jeremy.
   jsproat => .../j/js/jsproat.
   sproat  => .../s/sp/sproat.
   io      => .../i/io.
   com     => .../c/co/com.
   etc.

and inside the file, the entries look like this (this is a snapshot of the
last 16 lines of the 'jeremy' file):

   admin.general:2701 935490152 19 590 595 600 605
   admin.general:2702 935512318 1
   admin.general:2703 935512598 1
   admin.general:2704 935512754 1
   off-topic.geek:420 935513680 1
   cad.dev:2768 935516545 1
   off-topic.fun:dear-lego:440
   dear-lego:440 935516749 1
   cad.dev:2770 935522841 1
   cad:2444 935524881 1
   admin.general:2706 935530329 16
   admin.general:2707 935532070 28 36 61 101
   admin.general:2708 935533894 1 54
   off-topic.debate:1783 935534734 1
   publish:740 935535651 1
   faq:692 935536903 1

It currently contains 1486 lines and is 38197 bytes.  It can be processed in
a tiny fraction of a second, especially if it's already in the cache.

The 7th line shown above -- the one starting with "off-topic.fun" -- is a
crosspost directive.  It says, "if you're looking for the word 'jeremy' in
the .off-topic or .off-topic.fun categories, look also in .dear-lego article
number 440 even though you're not specificially looking for articles outside
of .off-topic or .off-topic.fun."

The total amount of disk space currently taken by all articles in the system
is 131 MB.  The word index takes up 155 MB and the tree/category indexes
(which enable things like showing the top N new articles in any category
instantly) take up another 9 MB.  So it's all still rather nice and small.


How long does it take to build the index for LUGNET?

Since the disk has to be hit dozens of times for each article (appending to
multiple files), it can take as long as 1 whole second to index a single
article (not CPU time -- that's minimal -- but actual elapsed time waiting
for the disk heads to come around).  So it took 22 hours to rebuild the
index recently from scratch.  That's horribly inefficient, however, since it
wasn't doing any write-spooling even though it was doing a batch-mode index
operation.  If I put a little bit of smarts in to spool up the appends and
write them out only every 25 megabytes or so, then it would speed up the
overall indexing considerably (by approximately 50x according to my
calculations).  I'll probably do this the next time the index needs to be
rebuilt.  (This was the only rebuild since the index was first started last
fall...it's meant to just run and run without problems...the only reason I
reindexed it was to add crossposting directives and timestamps.)

The present indexing system is scalable for probably another 100x or so of
traffic before new technology or algorithms are needed.


Does this length of time have an impact on when new articles can be
indexed?  [...]

No, it's a purely incremental index.  The time required to index an article
is dependent solely on the complexity of the article (the number of unique
words), not on how many other articles have been indexed so far.  The
indexer just runs continuously in the background checking for new articles
to index and only consumes a very minimal amount of processor time.

--Todd



Message has 1 Reply:
  reverse indexes (was Re: Text::Query)
 
(...) I'm having trouble figuring out how that's even possible. For example, the first sample line in the 'jeremy' file above appears to say that 'jeremy' occurs as the 19th, 590th, 595th 600th and 605th words of (URL) (which looks more or less (...) (25 years ago, 25-Aug-99, to lugnet.off-topic.geek)

Message is in Reply To:
  Re: Text::Query
 
(...) Yah. I actually was getting excited over the similarity in syntax; I hadn't even thought of the need for indexing such a huge database as LUGNET. BTW, what kind of scanner are you using for the index builder? Does it just break apart words (...) (25 years ago, 24-Aug-99, to lugnet.off-topic.geek)

11 Messages in This Thread:

Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR