|
The LUGNET News search function is now re-enabled. I completely revamped
the index data structures and list-merge algorithm and rewrote the core
query engine in C. It's a much more solid implementation.
Everyone's patience during the outage is much appreciated!
Functional improvements:
* Word proximity sensitivity -- two or more words closer together match better
than the same words far apart.
* Word-order sensitivity -- words in a specific order match better than the
same words out of order. For example, "new lego" returns different matches
than "lego new" (try it!).
* And you can still prefix words with + or - to require inclusion or
exclusion, respectively.
Cosmetic improvements:
* Graphical horizontal bar showing match rankings. (Close matches appear
with wider bars than lesser matches.)
* More streamlined and easy-to-read results header.
Internal improvements:
* Approximately 100 times faster (once the query engine receives the
request). Typical CPU utilization is less than 0.1 seconds even for
queries that generate tens of thousands of word hits. (Actual times
may vary depending on disk activity as word-hit lists are accessed.)
* The query engine can take any arbitrary list of news articles as a search
filter (include or exclude). This is how subgroup-searches are handled
now and will pave the way for cooler things later.
To do soon:
* Implement date range restrictions on searches. Currently searches entire
corpus of documents and assigns equal date-weight to all documents
regardless of age. This is actually working now at the inner levels and
at the URL level, but there is not yet a forms-based "advanced search"
user interface for specifying a target date and proximity.
To do someday:
* Facilitate searching within specific threads. This is a low-level data
list issue.
* Facilitate searching within arbitrary collections of groups (as opposed to
a single group or group hierarchy). This is mostly a user-interface issue.
* Facilitate searching within search results (i.e., "search only within these
results below").
* Rework the text indexer so that it doesn't throw out "funny characters" in
words and then reindex the entire document corpus from scratch. (Currently,
words like "won't" are converted to "wont" and words like "S@H" are ignored
entirely.)
* Filter out canceled articles from returned results.
--Todd
p.s. The article index database currently contains more than 10,000,000
word-hits.
|
|
Message has 4 Replies:
Message is in Reply To:
45 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|