Geeking : 422


Off-Topic / Geek / 422	421 \| 423

Subject:	Re: Text::Query
Newsgroups:	lugnet.off-topic.geek
Date:	Wed, 25 Aug 1999 02:54:56 GMT
Viewed:	2038 times

In lugnet.off-topic.geek, Sproaticus <jsproat@io.com> writes: > Todd Lehman wrote: > > In lugnet.general, Todd Lehman writes: > > > In lugnet.general, Jeremy H. Sproat writes: > > > > Whoa! Todd dude! You're using Text::Query, aren't you? Come on, > > > > fess up. > > > No, it's using a homebrew. I probably should look into Text::Query though, > > > if it's loaded with callbacks and/or easy member-function overloading. > > OK, I just took a look at Text::Query. From what I can tell from reading > > the docs and the source, it looks as though it's a brute force text scanner > > rather than an inverted-index generator. > > Yah. I actually was getting excited over the similarity in syntax; I hadn't > even thought of the need for indexing such a huge database as LUGNET. BTW, > what kind of scanner are you using for the index builder? Does it just > break apart words separated by whitespace / non-alphanumeric / etc. and keep > a running word/URI count, or is it something more magical? Not very magical, no. It breaks text apart by anything non-alphanumeric, where the "alpha" part includes ISO-8859-1 international letters like ã, ñ, ß, and ø, etc. It converts everything to lowercase for indexing and collapses apostrophes. Hyphenated words and words with dots or underscores get split into multiple words (not ideal). No stemming. Each word gets its own file, for example jeremy => .../j/je/jeremy. jsproat => .../j/js/jsproat. sproat => .../s/sp/sproat. io => .../i/io. com => .../c/co/com. etc. and inside the file, the entries look like this (this is a snapshot of the last 16 lines of the 'jeremy' file): admin.general:2701 935490152 19 590 595 600 605 admin.general:2702 935512318 1 admin.general:2703 935512598 1 admin.general:2704 935512754 1 off-topic.geek:420 935513680 1 cad.dev:2768 935516545 1 off-topic.fun:dear-lego:440 dear-lego:440 935516749 1 cad.dev:2770 935522841 1 cad:2444 935524881 1 admin.general:2706 935530329 16 admin.general:2707 935532070 28 36 61 101 admin.general:2708 935533894 1 54 off-topic.debate:1783 935534734 1 publish:740 935535651 1 faq:692 935536903 1 It currently contains 1486 lines and is 38197 bytes. It can be processed in a tiny fraction of a second, especially if it's already in the cache. The 7th line shown above -- the one starting with "off-topic.fun" -- is a crosspost directive. It says, "if you're looking for the word 'jeremy' in the .off-topic or .off-topic.fun categories, look also in .dear-lego article number 440 even though you're not specificially looking for articles outside of .off-topic or .off-topic.fun." The total amount of disk space currently taken by all articles in the system is 131 MB. The word index takes up 155 MB and the tree/category indexes (which enable things like showing the top N new articles in any category instantly) take up another 9 MB. So it's all still rather nice and small. > How long does it take to build the index for LUGNET? Since the disk has to be hit dozens of times for each article (appending to multiple files), it can take as long as 1 whole second to index a single article (not CPU time -- that's minimal -- but actual elapsed time waiting for the disk heads to come around). So it took 22 hours to rebuild the index recently from scratch. That's horribly inefficient, however, since it wasn't doing any write-spooling even though it was doing a batch-mode index operation. If I put a little bit of smarts in to spool up the appends and write them out only every 25 megabytes or so, then it would speed up the overall indexing considerably (by approximately 50x according to my calculations). I'll probably do this the next time the index needs to be rebuilt. (This was the only rebuild since the index was first started last fall...it's meant to just run and run without problems...the only reason I reindexed it was to add crossposting directives and timestamps.) The present indexing system is scalable for probably another 100x or so of traffic before new technology or algorithms are needed. > Does this length of time have an impact on when new articles can be > indexed? [...] No, it's a purely incremental index. The time required to index an article is dependent solely on the complexity of the article (the number of unique words), not on how many other articles have been indexed so far. The indexer just runs continuously in the background checking for new articles to index and only consumes a very minimal amount of processor time. --Todd

Message has 1 Reply:

		reverse indexes (was Re: Text::Query)
(...) I'm having trouble figuring out how that's even possible. For example, the first sample line in the 'jeremy' file above appears to say that 'jeremy' occurs as the 19th, 590th, 595th 600th and 605th words of (URL) (which looks more or less (...) (26 years ago, 25-Aug-99, to lugnet.off-topic.geek)

Message is in Reply To:

		Re: Text::Query
(...) Yah. I actually was getting excited over the similarity in syntax; I hadn't even thought of the need for indexing such a huge database as LUGNET. BTW, what kind of scanner are you using for the index builder? Does it just break apart words (...) (26 years ago, 24-Aug-99, to lugnet.off-topic.geek)

11 Messages in This Thread:

Entire Thread on One Page:: Nested: All | Brief | Compact | Dots
Linear: All | Brief | Compact
This Message and its Replies on One Page:: Nested: All | Brief | Compact | Dots
Linear: All | Brief | Compact

Custom Search