Re: latin1 decoder implementation from Doug Ewell on 2012-11-16 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Fri, 16 Nov 2012 17:45:08 -0700

If he is targeting HTML5, then none of this matters, because HTML5 says
that ISO 8859-1 is really Windows-1252.

For example, there is no C1 control called NL in Windows-1252. There is
only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell 
From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode_at_unicode.org
Subject: Re: latin1 decoder implementation
In fact not really, because Unicode DOES assign more precise semantics 
to a few of these controls, notably for those given whitespace and 
newline properties (notably TAB, LF, CR in C0 controls and NL in C1 
controls, with a few additional constraints for the CR+LF sequence) as 
they are part of almost all plain text protocols ; NUL also has a 
specific behavior which is so common that it cannot be mapped to 
anything else than a terminator or separator of plain text sequences.
So even if the ISO/IEC 8859 standard does not specify a charecter 
mapping in C0 and C1 controls, the registered MIME types are doing so 
(but nothing is well defined for the C0 and C1 controls except NUL, TAB, 
CR, LF, NL, for MIME usages purpose).
And then yes, the ISO/IEC 8859 standard is different (more restrictive) 
from the MIME charsets defined by the IETF in some RFC's (and registered 
in the IANA registry), simply because the ISO/IEC standard (encoded 
charset) was developed to be compatible with various encoding schemes, 
some of them defined by ISO, some others defined by other standard 
European or East-Asian bodies (including 7-bit schemes, using escape 
sequences, or shift in/out controls).
By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is 
just defining several encoded character sets, independantly of the 
encoding schme used to store or transport it (it is not even sufficient 
to represent any plain-text content).
On the opposite, The MIME "charsets" named "ISO_8859-*" registered by 
the IETF in the IANA registry are "concrete" encoding schemes, based on 
the ISO/IEC 8859 standard, and suitable for representing a plain-text 
content, because the MIME charsets are also adding a text presentation 
protocol.
In practice, almost nobody today uses the ISO/IEC 8859 standard alone : 
there's always an additional concrete protocol added on top of it (which 
generally makes use of the C0 and C1 controls, but not necessarily, and 
not always the same way). So plain-text documents never use the ISO/IEC 
8859 standard, but the MIME charsets (plus a few specific or proprietary 
charsets that have not been registered in the IANA registry as they are 
bound to a non-open protocol).

Received on Fri Nov 16 2012 - 18:46:28 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 18:46:29 CST