To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.publishOpen lugnet.publish in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Publishing / 1996
1995  |  1997
Subject: 
Re: ???Question???
Newsgroups: 
lugnet.publish, lugnet.admin.general
Date: 
Sat, 20 May 2000 03:36:49 GMT
Reply-To: 
mattdm@mattdm.org[antispam]
Viewed: 
1012 times
  
Todd Lehman <lehman@javanet.com> wrote:
  145 --> &#145;
  146 --> &#146;
  147 --> &#147;
  148 --> &#148;
Unfortunately, those positions don't seem to be defined in HTML 3.2, so
they'll only show up "correctly" (meaning, as intended by the author of
the message) on non-MS systems when someone uses MS fonts or fonts with
equivalent character mappings.

My understanding is that the ISO Latin 1 8-bit character set reserves those
characters (among others) for control codes. I can't actually check, because
the standard isn't available online (paper version costs about 56 CHF....).
But this is certainly the case for Unicode. Those characters are:

145 -> Private use one
146 -> Private use two
147 -> Set Transmit State
148 -> Cancel Character

(Ref: <http://charts.unicode.org/PDF/U0080.pdf>)

I'm happy to see that HTML 4.0 defines[1] these...
  &lsquo;  <==>   &#8216;   (equivalent to 145)
  &rsquo;  <==>   &#8217;   (equivalent to 146)
  &ldquo;  <==>   &#8220;   (equivalent to 147)
  &rdquo;  <==>   &#8221;   (equivalent to 148)

= Unicode <http://charts.unicode.org/PDF/U2000.pdf>.

I looked quickly at the source (admittedly, not a thorough scouring); it
looks like the mapping it applies is non-invertible, especially in the case
of 147 and 148.  :-(

Mapping them to the Unicode entities may be preferable. They work for me in
Navigator 4.7 on Win98 (have to wait til I get home to test on Linux). But
even if it doesn't work on some platforms yet, at least it's breaking
because the client isn't yet up to standards.

One the news side -- NNTP is technically 7-bit ascii, but almost always is
8-bit clean, and people certainly treat it that way. RFC 2130 (is there a
more recent document on this topic?) suggests that news messages specify the
charater set they are using in the header -- unfortunately, MS products
actually *lie*.

It may be better simply to reject MS-moronised messages altogether than to
attempt to convert it at the receiving end, because at least that way the
original meaning isn't destroyed.  (Actually, I'm not in favor of either of
those options anywhere near as much leaving the conversion up to each
individual client on-the-fly at display-time.)

How is the client supposed to know that it is to do conversion?

One partial fix would be to correct wrong headers to say "MS-Latin-1"....



--
Matthew Miller                      --->                  mattdm@mattdm.org
Quotes 'R' Us                     --->               http://quotes-r-us.org/
Boston University Linux             --->                http://linux.bu.edu/



Message has 1 Reply:
  Re: ???Question???
 
PS: if my tone seems annoyed or even antagonistic in the past few messages, it's not at anyone here -- it's at Microsoft. I try to avoid MS-bashing as much as I can, but this is blatently evil [1]. It's like the Kerberos thing, but arguably worse -- (...) (25 years ago, 20-May-00, to lugnet.publish, lugnet.admin.general)

Message is in Reply To:
  Re: ???Question???
 
(...) Hmmmm. I agree that it's pretty horrendous for plaintext, but I think that so-called "smart quotes" are a pretty great thing for HTML (as long as the correct standard character entities are output, of course! :) when done properly. What (...) (25 years ago, 19-May-00, to lugnet.publish, lugnet.admin.general)

11 Messages in This Thread:




Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR