Re: latin1 decoder implementation from Philippe Verdy on 2012-11-16 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 17 Nov 2012 01:35:09 +0100

In fact not really, because Unicode DOES assign more precise semantics to a
few of these controls, notably for those given whitespace and newline
properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with
a few additional constraints for the CR+LF sequence) as they are part of
almost all plain text protocols ; NUL also has a specific behavior which is
so common that it cannot be mapped to anything else than a terminator or
separator of plain text sequences.

So even if the ISO/IEC 8859 standard does not specify a charecter mapping
in C0 and C1 controls, the registered MIME types are doing so (but nothing
is well defined for the C0 and C1 controls except NUL, TAB, CR, LF, NL, for
MIME usages purpose).

And then yes, the ISO/IEC 8859 standard is different (more restrictive)
from the MIME charsets defined by the IETF in some RFC's (and registered in
the IANA registry), simply because the ISO/IEC standard (encoded charset)
was developed to be compatible with various encoding schemes, some of them
defined by ISO, some others defined by other standard European or
East-Asian bodies (including 7-bit schemes, using escape sequences, or
shift in/out controls).

By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is just
defining several encoded character sets, independantly of the encoding
schme used to store or transport it (it is not even sufficient to represent
any plain-text content).

On the opposite, The MIME "charsets" named "ISO_8859-*" registered by the
IETF in the IANA registry are "concrete" encoding schemes, based on the
ISO/IEC 8859 standard, and suitable for representing a plain-text content,
because the MIME charsets are also adding a text presentation protocol.

In practice, almost nobody today uses the ISO/IEC 8859 standard alone :
there's always an additional concrete protocol added on top of it (which
generally makes use of the C0 and C1 controls, but not necessarily, and not
always the same way). So plain-text documents never use the ISO/IEC 8859
standard, but the MIME charsets (plus a few specific or proprietary
charsets that have not been registered in the IANA registry as they are
bound to a non-open protocol).

2012/11/16 Whistler, Ken <ken.whistler_at_sap.com>

> No Unicode doesn’t. But yes, is **does** follow that decoding C0/C1
> control codes produces a Unicode code point of equal value. RTFM. TUS 6.2,
> p. 544:****
>
> ** **
>
> “There are 65 code points set aside in the Unicode Standard for
> compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022
> framework. … The Unicode Standard provides for the intact interchange of
> these code points, neither adding to nor subtracting from their semantics.
> The semantics of the control codes are generally determined by the
> application with which they are used. However, in the absence of specific
> application uses, they may be interpreted according to the control function
> semantics specified in ISO/IEC 6429:1992.”****
>
> ** **
>
> --Ken****
>
> ** **
>
> latin1 explicitly doesn't define characters (or control codes) in those
> ranges, but unicode does.****
>
> It doesn't directly follow that decoding a byte in those undefined ranges
> produces a unicode-point of equal value.****
>
> ** **
>
Received on Fri Nov 16 2012 - 18:36:26 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 18:36:26 CST