Re: latin1 decoder implementation

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Sat, 17 Nov 2012 21:45:38 +0900

On 2012/11/17 9:45, Doug Ewell wrote:
> If he is targeting HTML5, then none of this matters, because HTML5 says
> that ISO 8859-1 is really Windows-1252.

Yes. But unless Python wants to limit its use to HTML5, this should be
handled on a separate level (mapping a "iso-8859-1" label to the
Windows-1252 decoder logic), not by trying to change ISO-8859-1 itself.

Regards, Martin.

> For example, there is no C1 control called NL in Windows-1252. There is
> only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.
>
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell ­
>
>
> From: Philippe Verdy
> Sent: Friday, November 16, 2012 17:35
> To: Whistler, Ken
> Cc: Buck Golemon ; unicode_at_unicode.org
> Subject: Re: latin1 decoder implementation
>
>
> In fact not really, because Unicode DOES assign more precise semantics
> to a few of these controls, notably for those given whitespace and
> newline properties (notably TAB, LF, CR in C0 controls and NL in C1
> controls, with a few additional constraints for the CR+LF sequence) as
> they are part of almost all plain text protocols ; NUL also has a
> specific behavior which is so common that it cannot be mapped to
> anything else than a terminator or separator of plain text sequences.
>
> So even if the ISO/IEC 8859 standard does not specify a charecter
> mapping in C0 and C1 controls, the registered MIME types are doing so
> (but nothing is well defined for the C0 and C1 controls except NUL, TAB,
> CR, LF, NL, for MIME usages purpose).
>
> And then yes, the ISO/IEC 8859 standard is different (more restrictive)
> from the MIME charsets defined by the IETF in some RFC's (and registered
> in the IANA registry), simply because the ISO/IEC standard (encoded
> charset) was developed to be compatible with various encoding schemes,
> some of them defined by ISO, some others defined by other standard
> European or East-Asian bodies (including 7-bit schemes, using escape
> sequences, or shift in/out controls).
>
> By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is
> just defining several encoded character sets, independantly of the
> encoding schme used to store or transport it (it is not even sufficient
> to represent any plain-text content).
>
> On the opposite, The MIME "charsets" named "ISO_8859-*" registered by
> the IETF in the IANA registry are "concrete" encoding schemes, based on
> the ISO/IEC 8859 standard, and suitable for representing a plain-text
> content, because the MIME charsets are also adding a text presentation
> protocol.
>
> In practice, almost nobody today uses the ISO/IEC 8859 standard alone :
> there's always an additional concrete protocol added on top of it (which
> generally makes use of the C0 and C1 controls, but not necessarily, and
> not always the same way). So plain-text documents never use the ISO/IEC
> 8859 standard, but the MIME charsets (plus a few specific or proprietary
> charsets that have not been registered in the IANA registry as they are
> bound to a non-open protocol).
>
>
>
Received on Sat Nov 17 2012 - 06:47:48 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 17 2012 - 06:47:54 CST