From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 06 2005 - 11:42:21 CST
From: "Kenneth Whistler" <kenw@sybase.com>
> Philippe said some interesting things about the status of
> EU recommendations, Directives, etc., but...
>
>> For example, an application exchanging data encoded with the GB18030
>> charset
>> will be conforming, provided that it restricts itself to using only the
>> intersection of the GB18030 repertoire and the ISO/IEC 10646 repertoire.
>
> This is false. An application exchanging data encoded with GB18030 may
> be conforming to the GB18030 standard, but it is not thereby conformant
> to ISO/IEC 10646. If it exchanges a LATIN LETTER A WITH ACUTE with
> the byte sequence <A8 A2>, then it is indeed conforming to GB18030, but
> that is not a conformant representation of LATIN LETTER A WITH ACUTE
> in any encoding form for ISO/IEC 10646 (or the Unicode Standard).
>
> You are confusing the possibility of interoperability between GB18030
> data (and applications) and Unicode data (and applications) with
> the issues of conformance to particular standards.
Isn't the GB18030 encoding <A8 A2> *mapped* to U+00E1 (LATIN SMALL LETTER A
WITH ACUTE)?
Or are you saying that this GB18030 sequence does not make the distinction
between small and capital Latin letters?
When I look into the standard GB18030 mapping file (or even if I use only
the MS Windows 936 Chinese PRC charset mapping, which is an extension of
GB2312 that includes a part of the GB18030 standard), there's absolutely no
ambiguity to which abstract ISO/IEC 10646 character it corresponds: i.e. its
codepoint.
GB18030 is an encoding scheme, but not necessarily the "encoding form" used
in applications for their internal representation of strings. At least on
Windows, you can use the so-called "Unicode" string APIs to load a GB18030
encoded stream into the internal form.
*YOU* are implying that I make the confusion between a charset (which is the
combination of an encoding scheme and a encoded repertoire of characters, to
which a interchangeable and registered charset is assigned) used for
serialization and interchange of texts through streams of bytes, and a
encoding form (which is the internal representation used by string objects
in applications).
As GB18030 has an unambiguous and bijective mapping to the associated
Unicode codepoints, it is a valid encoding form for ISO/IEC 10646: it is
possible because the GB18030 character repertoire is encoding *abstract*
characters too.
The important keyword in the last sentence is "abstract". It means that
there's an abstraction of what constitutes a "character". This abstraction
mostly depends on the character model used for the scripts that are modeled
in the characters repertoire. If this abstraction (i.e. character model) is
the same or fully compatible the abstraction in ISO/IEC 10646, then the
charset itself becomes a compliant application of the ISO/IEC 10646
standard, because its encoded or encodable) repertoire will be fully
included in ISO/IEC 10646 (for GB18030 the two encoded repertoires are
equal, and the two sets of valid codes are related by a mapping function
which is fully bijective; such mapping function, and its reversed function,
are then defining an equivalence relation).
What would have been non conforming would be that GB18030 included special
constructions to represent the virtual character repertoire. For example, a
theorical charset could be built using the codes generated by the collation
keys generated and compatible with UCA.
In such a theorical charset, an existing Unicode/ISO/IEC10646 abstract
character would be represented (modeled) by a leading code specifying the
script and letter type, followed by a distinct code for making distinctions
between lowercase and uppercase, followed by other codes to add diacritics.
In such a theorical charset, there would exist additional "abstract
characters" to represent the collation level keys, as part of the abstract
repertoire. This would indeed have interesting properties for handling texts
(notably for full-text searches, or indexing, or for helping renderers with
fallbacks). But as these new codes assigned for the additional collation
difference would not match the abstraction (character model) used in ISO/IEC
10646, a valid string encoded in the theorical charset would not be
necessarily valid, or even simply encodable, in ISO/IEC 10646.
Another example: a charset that would encode glyphic differences for the
same abstract characters in ISO/IEC 10646 would not be compatible with it
(ISO/IEC 10646 would require that the glyphic differences be encoded out of
band, with an upper layer for rich texts).
Same thing for charsets that assign some codes to specific "unbreakable"
words, and for which ISO/IEC 10646 consider that they are composed of
several abstract characters.
GB18030 is not such a theorical incompatible charset. As well as all other
ISO charsets. This was the objective of ISO/IEC 10646 to allow mapping all
standard charsets to a common "universal" one, using the same level of
abstraction for characters (i.e. using the same character model).
However this gave some difficulties because some characters that users
legitimately consider as equivalent were then given distinct abstract code
points, or the encoding models were distinct. To solve that problem, Unicode
wanted to add the idea of "canonical equivalence" (which is not part of the
ISO/IEC 10646 conformance requirement, because ISO/IEC 10646 does not assign
the equivalence mappings and combining classes, that Unicode defines). This
magically solved most (not all!) of the problems between otherwise
incompatible character models used in Latin, Greek, and Cyrillic, when
unification of abstract characters were not possible without breaking the
mappability of legacy charsets referenced in the ISO/IEC 10646 standard.
If you want really to exhibit a standard which is NOT compatible with
ISO/IEC 10646, you will need to consider ISO2022: it defines abstract
sequences (which can be viewed as abstract characters) which are not part of
the repertoire encoded in ISO/IEC10646: think about the escape sequences
that allow selecting charsets. For ISO/IEC10646, these sequences have no
code points, and are void. But for ISO 2022, they have their own abstract
identity, that is necessary to allow interpreting correctly the other
characters encoded in the ISO2022 text.
Same thing for the abstract sequences defined in Videotex/Teletex or in many
terminal emulation encoding standards to select attributes for the
surrounding characters, or to add some additional semantics (similar to
markup in SGML, HTML or XML). These valid encoded sequences are not abstract
characters for ISO/IEC 10646 or for Unicode, but they are unbreakable
processing units for the related other standards.
So, show me one example in the GB18030, where there's a valid unbreakable
processing unit encoded, which does not represent a valid ISO/IEC10646
code-point/abstract-character, and I will accept your fact. I hope this is
not the case, or compliance with the GB18030 standard requires more than
what we have read until now, because I have always assumed that GB18030
could be safely decoded/reencoded as a valid encoding scheme representing
Unicode/ISO/IEC10646 codepoints, without ambiguities or fallbacks.
This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 12:16:24 CST