RE: Undefined code positions in 8-bit character sets

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 10 2008 - 05:24:37 CDT

Next message: Jeroen Ruigrok van der Werven: "Siddham"

Previous message: Philippe Verdy: "RE: Google posting about U5.1"
In reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> -----Message d'origine-----
> De : unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] De la part de Andreas Prilop
> Envoyé : lundi 5 mai 2008 17:31
> À : unicode@unicode.org
> Objet : Undefined code positions in 8-bit character sets
>
> I refer to
> http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
> CP1252.TXT
>
> In ISO-8859-1, code position 0x90 is mapped to U+0090.
> In Windows-1252, code position 0x90 is listed as "undefined".
>
> Why are they treated differently?

Windows codepages have never defined any C1 control in positions 80-9F.
These were always reserved in all versions of these codepages for extensions
to map "graphical" characters; so initially most of them were undefined
until they were later assigned to characters. If they had been assigned to
C1 controls, they would no longer be available for these extensions.

> International Standard ISO/IEC 8859-1 does *not* define code
> position 0x90. So it might also be listed as "undefined".

Yes but ISO 8859 does not define any mapping for any of its variants: this
was done to be compatible with other transport or presentation protocols; it
does not formally define a physical encoding, so the ISO 8859 standard may
be transported over 7-bit protocols (for example using SS2/SS3 control
sequences or other ISO 2022 compatible encodings.

The IANA registration of the "iso-8859-1" encoding is defining in fact a
encoding transformation from two standards: the ISO/IEC 8859-*, C0 and C1
encodings are transformed into a merged 8-bit encoding.

On the opposite, the IANA registration of Windows codepages is exactly the
same as the Windows codepages, there's no merging and nothing is provided to
offer compatibility with other transport or presentation protocols, so it
supports only the 8-bit serialization.

There's a difference between the coded character set and the IANA encoding
which in fact merges several layers: the code mapping and the serialization
into a stream of bytes.

> Or, for purely practical reasons, 0x90 in Windows-1252 might
> also be mapped to U+0090.

This can just be a fallback encoding, but it is non-standard. The mapping
may change at any time. Formally it is still undefined (and there's no sign
now that it will be assigned later), and applications may map other
application specific fallbacks (including U+FFFD or no mapping at all
raising an exception in the decoder).

For example, the Java "Charset" decoder maps an exception on this byte. It
is normally preferable to map an exception or error on this position,
letting the application choose what to do about this undefined position (for
example, sich code suggests that the stream is effectively not encoded with
Windows-1252, and another encoding should be tried.)

NO standard document can contain any 0x90 byte if it claims to be encoded
with Windows-1252; on the opposite, on the web, the "iso-8859-1" IANA
registration is standard for the HTTP protocol, or for tagging internally
documents like HTML or XML in some attribute or in HTTP presentation headers
so it effectively maps C1 controls (which is standard there).

If you look into what Windows effectiely does, there are two distinct
implementations: one is found in the Win32 API that performs "ANSI to OEM"
conversions or the reverse; but at the same time the Win32 API provides a
way to customize the behavior in case of undefined code positions: the
fallback is parametrable, and it may provide a default character such as the
question mark "?", or a "do-nothing" option (leaving the code "unchanged")
to avoid exceptions, or an error status during the conversion, raising an
exception. Another conversion API can be used to perform "Multibyte to
Unicode" conversions (or the reverse). Another better conversion API is
performed in .Net libraries (that support many more charsets, mappings and
conversions), in a way that is compatible across versions of Windows (the
Win32 conversions are much more limited and are not extensible).

So it's correct to have mappings for ISO-8859-1 that maps a C1 control
U+0090 for 0x90 and no mapping at all for the Windows codepage were NO
provision was made to allow C1 controls, notably when these mappings are
used in the context of charset identification using IANA registered codes
for tagging web contents in HTTP headers or in HTML, SGML, XML attributes
(or pseudo-attribute of a document declaration tag).

It is also interesting to look at how the ISO-8859-x and Windows-12xx
codepages are remapped to EBCDIC-compatible codepages for roundtrip
reversibility: there exists a full remapping of codes for ISO-8859-x
(including C0 and C1 controls) to EBCDIC (with full reversibility in both
directions) but a partial remapping for Windows-12xx codepages (or several
EBCDIC variants, treating the Windows-12xx's 90-90 range differently, but
most of these EBCDIC codepages don't have any IANA registration with a
standard identifier (so they are not intended for data interchange in a
heterogeneous networking environment). You may have to look for a very list
of IBM-defined codepages defined only for local compatibility (some of them
are installable on Windows using the Regional Settings control panel).

Note finally that the .Net conversion libraries also allowing applications
to specify the fallback mechanism to use in case of undefined code
positions; but ISO-8859-* are guaranteed to never throw any decoder
exception, and will never return U+FFFD or a fallback "?" character, or any
C0 control like SUB if there's no SUB effectively encoded in the byte
stream.

Note finally that on Windows, Internet Explorer is not decoding ISO-8859-1
using the standard assignment defined in IANA.: one (good?) reason is that
most C0 and C1 controls are illegal in standard HTML/XML documents, even
when using a charset like "iso-8859-1" that map them, but Internet Explorer
will not invalidate the document if instructed to not "guess" another
encoding; instead it will handle the "ISO-8859-1" tagging as if it was
"Windows-1252" (meaning effectively that 0x80 will still be rendered a euro
symbol, even if the document declares itself being encoded with the
"ISO-8859-1" IANA-registered charset).

Some Microsoft tools are generating bogous documents, such as web design
tools (like FrontPage): it allows inserting euro symbols encoded 0x80 in
ISO-8859-* charsets, or bullets, without any warning given to the user when
saving the HTML page: at least these tools should propose to switch the
encoding to Windows-1252 or to Unicode UTF-8, or it should use named or
numeric character entities. When you edit a standard document declared with
ISO-8859-* and using the expected and correct named or numeric character
entities, and the nsave the edited HTML file, it silently replaces the
entities with single byte codes, without changing the declared encoding or
without prompting the user to do this; if the user maintains the ISO-8859-1
charset, the euro symbols, bullets, ellipsis, rounded quotation marks or
apostrophes should be saved as character entities in the HTML document.

It think it has always been a severe bug of FrontPage (which exists and
persists since many years now and also exists in IE itself when reading the
page HTML content from DOM, and has never been corrected despite it was
signaled since long: it causes severe compatibility problems, except with
Internet Explorer that silently, but incorrectly, interprets a specified
"ISO-8859-1" charset as if it was "Windows-1252"; for this reason, it's best
to describe the situation by saying that Internet Explorer does not support
correctly the ISO-8859-* registered charsets; this non standard behavior
however has been added in other browsers to support the many web pages using
this IE "quirk" mode; this old IE behavior should never be mimic'ed in
standard mode as it does not respect the HTML and XML standards which
clearly indicates that the IANA charset registration must be respected;
apparently the bug is in the DOM HTML implementation of IE and affects
FrontPage directly as it uses IE's DOM engine to perform the actual edits or
to save the HTML code of the edited pages).

Next message: Jeroen Ruigrok van der Werven: "Siddham"
Previous message: Philippe Verdy: "RE: Google posting about U5.1"
In reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 10 2008 - 09:06:34 CDT