|
In lugnet.publish, Matthew Miller writes:
> Urg. It may not be obvious to those of you viewing this message with MS
> Windows, but the above message isn't ascii text (or ISO 8859-1 Latin-1,
> either -- even though the header claims it is!). It's Microsoft's
> non-standard [1] character set. This makes the message look pretty weird
> when viewed on a non-MS system -- all of the apostrophes show up as question
> marks (or don't show up at all).
>
> Since asking everyone to not use Microsoft products to read LUGnet is
> probably a bit harsh [2], Todd, how about automatically scanning for this
> and correcting it when messages are posted?
Hmmmm. I agree that it's pretty horrendous for plaintext, but I think that
so-called "smart quotes" are a pretty great thing for HTML (as long as the
correct standard character entities are output, of course! :) when done
properly.
What currently happens in the web interface when someone views a message with
these is that they get mapped into HTML entities like this:
145 --> ‘
146 --> ’
147 --> “
148 --> ”
Unfortunately, those positions don't seem to be defined in HTML 3.2, so
they'll only show up "correctly" (meaning, as intended by the author of
the message) on non-MS systems when someone uses MS fonts or fonts with
equivalent character mappings.
I'm happy to see that HTML 4.0 defines[1] these...
‘ <==> ‘ (equivalent to 145)
’ <==> ’ (equivalent to 146)
“ <==> “ (equivalent to 147)
” <==> ” (equivalent to 148)
...but I haven't tested these in popular browsers to see if they're worth
using yet. I switched from ™ to ™ for the TM symbol a while back
and that has worked well.
> There's an already existing tool:
> <http://www.fourmilab.ch/webtools/demoroniser/>
> (That page also has more good info on the problem.)
I looked quickly at the source (admittedly, not a thorough scouring); it
looks like the mapping it applies is non-invertible, especially in the case
of 147 and 148. :-(
It may be better simply to reject MS-moronised messages altogether than to
attempt to convert it at the receiving end, because at least that way the
original meaning isn't destroyed. (Actually, I'm not in favor of either of
those options anywhere near as much leaving the conversion up to each
individual client on-the-fly at display-time.)
--Todd
[1] http://www.w3.org/TR/1998/REC-html40-19980424/sgml/entities.html
|
|
Message has 2 Replies: | | Re: ???Question???
|
| (...) My understanding is that the ISO Latin 1 8-bit character set reserves those characters (among others) for control codes. I can't actually check, because the standard isn't available online (paper version costs about 56 CHF....). But this is (...) (25 years ago, 20-May-00, to lugnet.publish, lugnet.admin.general)
|
Message is in Reply To:
| | Re: ???Question???
|
| (...) Urg. It may not be obvious to those of you viewing this message with MS Windows, but the above message isn't ascii text (or ISO 8859-1 Latin-1, either -- even though the header claims it is!). It's Microsoft's non-standard [1] character set. (...) (25 years ago, 11-May-00, to lugnet.publish, lugnet.admin.general)
|
11 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
This Message and its Replies on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|