To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.loc.auOpen lugnet.loc.au in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Local / Australia / 10050
10049  |  10051
Subject: 
Re: .loc.au stats for October
Newsgroups: 
lugnet.loc.au
Date: 
Tue, 5 Nov 2002 14:02:09 GMT
Viewed: 
1106 times
  
In lugnet.loc.au, Kerry Raymond writes:
Of course, to do this we'd need to work out how to measure the quality per
word or post quality values...

The quality of a post can be coarsely approximated by applying quality metrics
to each individual word, giving an overall posting content quality. So, say, a
message involving quality words like "sensate", "post-expressionalism" and
"45.7%" would generate a higher quality score than a message involving words
like "lots", "stuff", and "wow". Indeed, one can even apply severe negative
volume metrics to such words as "wasssssssssuuupppp" and "insectoids", thus
reducing such any postings involving such words to hideously low levels,
unredeemable even in such contexts as "45.7% of all insectoid sets incorporate
post-expressionalism detectable to the more sensate builder". Any message
involving lots of numbers is almost certainly good, as these are either set
numbers, part numbers, or Richie's monthly statistics, so numbers will
generate
high quality word ratings.

However, the posting content quality is not the final score. You must then
multiply it by bytes of new content divided by the bytes of included quoted
content. This ensures that "Me too" messages are forced down the quality >metric
no matter how frequently post-expressionalism is mentioned.

Then you take any quoted content and recursively apply the quality analysis to
it, and then compare the quality of the new content with the quoted content.
The quality of the quoted content is then a ceiling on the possible quality of
the new content. This is intended to discourage the proliferation of threads >of
low quality by downgrading all subsequent contributions no matter how sensate.

This produces the primary quality score for the posting.

Finally some people are fundamentally low-quality authors of postings, and >this
must be incorporated to produce the modified quality score. This cannot be >done
initially but can be introduced once a certain volume of primary quality score
data has been collected. By determining average primary quality for a given
author, one can determine an author quality as a moving average. You then take
the ratio of specific author quality divided by average author quality to get
the author-multiplier which is then (as the name suggests) multiplied to the
primary quality score to produce the modified quality score.

Note. It is very important to determine the specific author quality and >average
author quality scores from the *primary* quality scores, and not from the
modified quality scores. As can be appreciated, the use of modified quality
scores to determine specific author quality and average author quality will
introduce unbounded escalating feedback.

Having determined the modified quality score for each posting, it is then
entirely mechanical to determine the total quality of posts by that author in
any given time period (e.g. a month) as well as to derive the average quality
of posts by that author over the same time period.

However, over a sufficiently long period, it is likely that the average >quality
of posts over a given period is likely to approximate the author-multiplier
(as
a relative but not absolute scale). This makes it very difficult for anyone to
significantly lift their game quality-wise. So, the use of moving average for
computing the author-multiplier will need (over time, but probably not
immediately) to incorporate an aging of older post quality data, probably >based
on some kind of inverse Poisson differential decay. I'm not sure what chi->value
to use to stretch the probability curve, but I would think we'd be aiming for >a
half-life of around 3 months, so probably something in the range of 1.5 to 2
should be OK (using base "e" not 10, of course).

So, while some people will argue that quantitative evaluation of quality is
fundamentally flawed no matter what interpretation of Chomsky conceptual
analysis you use, I say (quoting from The Matrix here) "never send a person to
do a machine's job". My method can be entirely automated (essential for >ongoing
maintenance) and we could have aggregated ratings on each newsgroup, allowing
the LUGnet traffic page to be based on quality rather than quantity metrics.
Then when loc.au wants to take on the Italians (say), then we have to do it
based on superior quality of postings and not just hitting send a lot.


My concern here with taking on the Italians is that they post in Italian.
Therefore any quality metric has to be multi lingual. That goes without
saying so that's presumably why Kerry didn't say it, but I needed to boost
my new/quoted material ratio, you know.

More importantly though is that the metric needs to take into account the
variance between languages in such areas as average word length, average
sentence length, average vocabulary and so forth, so that posts in German
don't unfairly score higher than they deserve because they use words like
"Hauptbahnhoffwurstyunge" where English would use "central station sausage
seller" in everyday usage.

Clearly English would be penalised here for using shorter words. Although it
should be noted, there would be some advantage from increased sentence
length, and a full statistical analysis would be needed to determine whether
this was a sufficient offset.


Indeed, increasing the quality ratings on a newsgroup can be achieved either
by
sending many high-quality posts, or by discouraging the sending of low-quality
postings.

Yes but how do you do this? Who bells the cat?

Thus some newsgroups could achieve a higher quality aggregate through
having no postings than most of the market newsgroups (which are likely to
generate large negative aggregates).

Waddya reckon?


Good start but this should have been posted to off-topic.geek...

++Lar (who in college once wrote a program to count average word length,
sentence length and word frequency, and who then fed the corpus of Genesis
lyrics in as well as the corpus of Foreigner lyrics, and entirely
straightfacedly, asserted to his linguistics class that, based on the
analysis, (wider word usage, less repetition, longer sentence length)
Genesis was a more literate band than Foreigner, and who therefore has some
sympathy for this undertaking of Kerry's)



Message has 1 Reply:
  Re: .loc.au stats for October
 
(...) I had thought of discussing the I18N issues but decided against it for two cogent reasons: 1. if the Italians were not alerted to it, we could have quickly slipped the quality word metrics into place using only English -- thus all postings in (...) (22 years ago, 5-Nov-02, to lugnet.loc.au)

Message is in Reply To:
  Re: .loc.au stats for October
 
(...) The quality of a post can be coarsely approximated by applying quality metrics to each individual word, giving an overall posting content quality. So, say, a message involving quality words like "sensate", "post-expressionalism" and "45.7%" (...) (22 years ago, 5-Nov-02, to lugnet.loc.au)

21 Messages in This Thread:











Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR