Re: MES instead of ISO 8859-nn

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Fri Jul 04 1997 - 16:14:50 EDT


Jonathan Rosenne wrote on 1997-07-04 19:08 UTC:
> >Some software makers have been ignorant in the past, but they
> >have catched up. If you think this one hasn't, please tell me
> >their name in private, and I will contact them.
>
> This would be allowed if the HTML charset will be coded correctly as
> CP1250. I guess authoring tools will gradually get over producing a
> misleading 8859-1 specification, which many do now.
>
> A note to authoring tools producers: If you do not know for sure that it is
> 8859-1 don't produce this charset specification. Either get the correct
> data from the operating system or ask the user, and if this is not possible
> it is better you do nothing!

No no no no no!!!!

Please don't spread such wrong advice!!!

CP1252 is not an IANA registered MIME charset (and I hope it never will be).
If you do not announce anything in HTTP, the default is ISO 8859-1.
Numeric Character References have to be interpreted as ISO 10646-1 codes
only. See <ftp://ds.internic.net/rfc/rfc1866.txt>, read the last sentence
of section 1.2.1: "[...] numeric character references agree with
[ISO-10646] regardless of how the document is encoded."

If you see like on <http://www.msnbc.com/news/83531.asp>:

  &#0147;Robotic exploration,&#0148; he says, &#0147;is just the first step.

numeric character references in the range 128-159, then this is
simply wrong illegal HTML.

The only way to handle this correctly is to use the appropriate
ISO 10646 numeric character references here (or to map the quotes to the
normal ASCII quotes).

The Unicode characters corresponding to the CP1252 characters in the
range 128-159 are:

0x80 0x0080 #NOT USED
0x81 0x0081 #NOT USED
0x82 0x201a #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201e #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02c6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8a 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8b 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8c 0x0152 #LATIN CAPITAL LIGATURE OE
0x8d 0x008d #NOT USED
0x8e 0x008e #NOT USED
0x8f 0x008f #NOT USED
0x90 0x0090 #NOT USED
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201c #LEFT DOUBLE QUOTATION MARK
0x94 0x201d #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02dc #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9a 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9b 0x203a #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9c 0x0153 #LATIN SMALL LIGATURE OE
0x9d 0x009d #NOT USED
0x9e 0x009e #NOT USED
0x9f 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS

All that authors of tools that produce HTML files under Windows have to do is
to replace a byte 0x93 found in the input Winword file by the correct
Unicode value 0x201c, i.e. as a decimal numeric character reference
by &#8220; in the produced output file, and not as above &#147; which
many do now. Then Netscape Communicator will still show the correct
character on the screen, but now not only on Microsoft platforms.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT