Re: ISO 10646 compliance and EU law

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 06 2005 - 11:42:21 CST

  • Next message: Peter Kirk: "Re: ISO 10646 compliance and EU law"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Philippe said some interesting things about the status of
    > EU recommendations, Directives, etc., but...
    >
    >> For example, an application exchanging data encoded with the GB18030
    >> charset
    >> will be conforming, provided that it restricts itself to using only the
    >> intersection of the GB18030 repertoire and the ISO/IEC 10646 repertoire.
    >
    > This is false. An application exchanging data encoded with GB18030 may
    > be conforming to the GB18030 standard, but it is not thereby conformant
    > to ISO/IEC 10646. If it exchanges a LATIN LETTER A WITH ACUTE with
    > the byte sequence <A8 A2>, then it is indeed conforming to GB18030, but
    > that is not a conformant representation of LATIN LETTER A WITH ACUTE
    > in any encoding form for ISO/IEC 10646 (or the Unicode Standard).
    >
    > You are confusing the possibility of interoperability between GB18030
    > data (and applications) and Unicode data (and applications) with
    > the issues of conformance to particular standards.

    Isn't the GB18030 encoding <A8 A2> *mapped* to U+00E1 (LATIN SMALL LETTER A
    WITH ACUTE)?
    Or are you saying that this GB18030 sequence does not make the distinction
    between small and capital Latin letters?

    When I look into the standard GB18030 mapping file (or even if I use only
    the MS Windows 936 Chinese PRC charset mapping, which is an extension of
    GB2312 that includes a part of the GB18030 standard), there's absolutely no
    ambiguity to which abstract ISO/IEC 10646 character it corresponds: i.e. its
    codepoint.

    GB18030 is an encoding scheme, but not necessarily the "encoding form" used
    in applications for their internal representation of strings. At least on
    Windows, you can use the so-called "Unicode" string APIs to load a GB18030
    encoded stream into the internal form.

    *YOU* are implying that I make the confusion between a charset (which is the
    combination of an encoding scheme and a encoded repertoire of characters, to
    which a interchangeable and registered charset is assigned) used for
    serialization and interchange of texts through streams of bytes, and a
    encoding form (which is the internal representation used by string objects
    in applications).

    As GB18030 has an unambiguous and bijective mapping to the associated
    Unicode codepoints, it is a valid encoding form for ISO/IEC 10646: it is
    possible because the GB18030 character repertoire is encoding *abstract*
    characters too.

    The important keyword in the last sentence is "abstract". It means that
    there's an abstraction of what constitutes a "character". This abstraction
    mostly depends on the character model used for the scripts that are modeled
    in the characters repertoire. If this abstraction (i.e. character model) is
    the same or fully compatible the abstraction in ISO/IEC 10646, then the
    charset itself becomes a compliant application of the ISO/IEC 10646
    standard, because its encoded or encodable) repertoire will be fully
    included in ISO/IEC 10646 (for GB18030 the two encoded repertoires are
    equal, and the two sets of valid codes are related by a mapping function
    which is fully bijective; such mapping function, and its reversed function,
    are then defining an equivalence relation).

    What would have been non conforming would be that GB18030 included special
    constructions to represent the virtual character repertoire. For example, a
    theorical charset could be built using the codes generated by the collation
    keys generated and compatible with UCA.

    In such a theorical charset, an existing Unicode/ISO/IEC10646 abstract
    character would be represented (modeled) by a leading code specifying the
    script and letter type, followed by a distinct code for making distinctions
    between lowercase and uppercase, followed by other codes to add diacritics.
    In such a theorical charset, there would exist additional "abstract
    characters" to represent the collation level keys, as part of the abstract
    repertoire. This would indeed have interesting properties for handling texts
    (notably for full-text searches, or indexing, or for helping renderers with
    fallbacks). But as these new codes assigned for the additional collation
    difference would not match the abstraction (character model) used in ISO/IEC
    10646, a valid string encoded in the theorical charset would not be
    necessarily valid, or even simply encodable, in ISO/IEC 10646.

    Another example: a charset that would encode glyphic differences for the
    same abstract characters in ISO/IEC 10646 would not be compatible with it
    (ISO/IEC 10646 would require that the glyphic differences be encoded out of
    band, with an upper layer for rich texts).

    Same thing for charsets that assign some codes to specific "unbreakable"
    words, and for which ISO/IEC 10646 consider that they are composed of
    several abstract characters.

    GB18030 is not such a theorical incompatible charset. As well as all other
    ISO charsets. This was the objective of ISO/IEC 10646 to allow mapping all
    standard charsets to a common "universal" one, using the same level of
    abstraction for characters (i.e. using the same character model).

    However this gave some difficulties because some characters that users
    legitimately consider as equivalent were then given distinct abstract code
    points, or the encoding models were distinct. To solve that problem, Unicode
    wanted to add the idea of "canonical equivalence" (which is not part of the
    ISO/IEC 10646 conformance requirement, because ISO/IEC 10646 does not assign
    the equivalence mappings and combining classes, that Unicode defines). This
    magically solved most (not all!) of the problems between otherwise
    incompatible character models used in Latin, Greek, and Cyrillic, when
    unification of abstract characters were not possible without breaking the
    mappability of legacy charsets referenced in the ISO/IEC 10646 standard.

    If you want really to exhibit a standard which is NOT compatible with
    ISO/IEC 10646, you will need to consider ISO2022: it defines abstract
    sequences (which can be viewed as abstract characters) which are not part of
    the repertoire encoded in ISO/IEC10646: think about the escape sequences
    that allow selecting charsets. For ISO/IEC10646, these sequences have no
    code points, and are void. But for ISO 2022, they have their own abstract
    identity, that is necessary to allow interpreting correctly the other
    characters encoded in the ISO2022 text.

    Same thing for the abstract sequences defined in Videotex/Teletex or in many
    terminal emulation encoding standards to select attributes for the
    surrounding characters, or to add some additional semantics (similar to
    markup in SGML, HTML or XML). These valid encoded sequences are not abstract
    characters for ISO/IEC 10646 or for Unicode, but they are unbreakable
    processing units for the related other standards.

    So, show me one example in the GB18030, where there's a valid unbreakable
    processing unit encoded, which does not represent a valid ISO/IEC10646
    code-point/abstract-character, and I will accept your fact. I hope this is
    not the case, or compliance with the GB18030 standard requires more than
    what we have read until now, because I have always assumed that GB18030
    could be safely decoded/reencoded as a valid encoding scheme representing
    Unicode/ISO/IEC10646 codepoints, without ambiguities or fallbacks.



    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 12:16:24 CST