Re: latin1 decoder implementation

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Mon, 19 Nov 2012 18:57:57 +0900

On 2012/11/17 9:56, Philippe Verdy wrote:
> True. HTML5 makes its own reinterpretation of the IETF's MIME standard,
> definining it own protocol (which means that it is no longer fully
> compatible with MIME and its IANA datatabase, because the mapping of the
> value of a charset="" pseudo-attribute is not directly to the IETF MIME
> standard, but to a newer range of W3C standards).
>
> There was a clear desire from the W3C to deprecate the use of the MIME
> standard and its IANA database in HTML, to simplify the implementations

There is no need to deprecate the use of MIME in order to simplify
implementations. No MIME-compatible implementation is required to accept
and understand all "charset"s defined in the IANA registry. There are
numerous Mime types that restrict the number of possible character
encodings to a small set, or only require implementation of very few of
them (XML would be a typical example).

> (also to avoid the many incompatibilities that have occured in the past
> with MIME charsets between the implementations).

That's the main motivation. One browser started to accept data in a form
that it shouldn't have accepted. Sloppy content producers started to
rely on this. Because the browser in question was the dominant browser,
other browsers had to try and re-engineer and follow that browser, or
just be ignored. The Encoding Spec is an attempt, hopefully successful,
to limit these incompatibilities to those that exist today, and not let
them increase further.

> Note also that the W3C
> does not automatically endorses the Unicode and ISO/IEC 10646 standards as
> well (there's a delay before accepting newer releases of TUS and ISO/IEC
> 10646, and the W3C frequently adds now several restrictions).

Can you give examples? As far as I'm aware, the W3C has always tried to
make sure that e.g. new characters encoded in Unicode can be used as
soon as possible. There are some cases where this has been missed in the
past (e.g. XML naming rules), but where corrective action has been taken.

Regards, Martin.

> 2012/11/17 Doug Ewell<doug_at_ewellic.org>
>
>> If he is targeting HTML5, then none of this matters, because HTML5 says
>> that ISO 8859-1 is really Windows-1252.
>>
>> For example, there is no C1 control called NL in Windows-1252. There is
>> only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.
>>
>>
>> --
>> Doug Ewell | Thornton, Colorado, USA
>> http://www.ewellic.org | @DougEwell ­
>>
>>
>> From: Philippe Verdy
>> Sent: Friday, November 16, 2012 17:35
>> To: Whistler, Ken
>> Cc: Buck Golemon ; unicode_at_unicode.org
>>
>> Subject: Re: latin1 decoder implementation
>>
>>
>> In fact not really, because Unicode DOES assign more precise semantics to
>> a few of these controls, notably for those given whitespace and newline
>> properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with
>> a few additional constraints for the CR+LF sequence) as they are part of
>> almost all plain text protocols ; NUL also has a specific behavior which is
>> so common that it cannot be mapped to anything else than a terminator or
>> separator of plain text sequences.
>>
>> So even if the ISO/IEC 8859 standard does not specify a charecter mapping
>> in C0 and C1 controls, the registered MIME types are doing so (but nothing
>> is well defined for the C0 and C1 controls except NUL, TAB, CR, LF, NL, for
>> MIME usages purpose).
>>
>> And then yes, the ISO/IEC 8859 standard is different (more restrictive)
>> from the MIME charsets defined by the IETF in some RFC's (and registered in
>> the IANA registry), simply because the ISO/IEC standard (encoded charset)
>> was developed to be compatible with various encoding schemes, some of them
>> defined by ISO, some others defined by other standard European or
>> East-Asian bodies (including 7-bit schemes, using escape sequences, or
>> shift in/out controls).
>>
>> By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is just
>> defining several encoded character sets, independantly of the encoding
>> schme used to store or transport it (it is not even sufficient to represent
>> any plain-text content).
>>
>> On the opposite, The MIME "charsets" named "ISO_8859-*" registered by the
>> IETF in the IANA registry are "concrete" encoding schemes, based on the
>> ISO/IEC 8859 standard, and suitable for representing a plain-text content,
>> because the MIME charsets are also adding a text presentation protocol.
>>
>> In practice, almost nobody today uses the ISO/IEC 8859 standard alone :
>> there's always an additional concrete protocol added on top of it (which
>> generally makes use of the C0 and C1 controls, but not necessarily, and not
>> always the same way). So plain-text documents never use the ISO/IEC 8859
>> standard, but the MIME charsets (plus a few specific or proprietary
>> charsets that have not been registered in the IANA registry as they are
>> bound to a non-open protocol).
>>
>>
>
Received on Mon Nov 19 2012 - 04:00:34 CST

This archive was generated by hypermail 2.2.0 : Mon Nov 19 2012 - 04:00:34 CST