Re: ISO 10646 compliance and EU law

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 06 2005 - 15:34:45 CST

  • Next message: Mark Davis: "Re: Re: ISO 10646 compliance and EU law"

    > >> For example, an application exchanging data encoded with the GB18030
    > >> charset
    > >> will be conforming, provided that it restricts itself to using only the
    > >> intersection of the GB18030 repertoire and the ISO/IEC 10646 repertoire.
    > >
    > > This is false. An application exchanging data encoded with GB18030 may
    > > be conforming to the GB18030 standard, but it is not thereby conformant
    > > to ISO/IEC 10646. If it exchanges a LATIN LETTER A WITH ACUTE with
    > > the byte sequence <A8 A2>, then it is indeed conforming to GB18030, but
    > > that is not a conformant representation of LATIN LETTER A WITH ACUTE
    > > in any encoding form for ISO/IEC 10646 (or the Unicode Standard).
    > >
    > > You are confusing the possibility of interoperability between GB18030
    > > data (and applications) and Unicode data (and applications) with
    > > the issues of conformance to particular standards.
    >
    > Isn't the GB18030 encoding <A8 A2> *mapped* to U+00E1 (LATIN SMALL LETTER A
    > WITH ACUTE)?

    Of course.

    > Or are you saying that this GB18030 sequence does not make the distinction
    > between small and capital Latin letters?

    No, of course not.

    If an application is representing LATIN SMALL LETTER A WITH ACUTE as
    <A8 A2>, then it is conforming with GB 18030-2000. (And also,
    not coincidentally, GB 2312-1980 and Microsoft Code Page 936.)

    If an application is representing LATIN SMALL LETTER A WITH ACUTE as
    U+00E1 (<0xC3 0x91>, 0x00E1, 0x000000E1, depending on encoding form),
    then it is conforming with the Unicode Standard (and ISO/IEC 10646:2003).

    If an application is mapping between the two, then it is interoperating.
    But the fact that a mapping table exists does not demonstrate that
    a Unicode application itself is conforming to the Unicode Standard.

    > When I look into the standard GB18030 mapping file (or even if I use only
    > the MS Windows 936 Chinese PRC charset mapping, which is an extension of
    > GB2312 that includes a part of the GB18030 standard), there's absolutely no
    > ambiguity to which abstract ISO/IEC 10646 character it corresponds: i.e. its
    > codepoint.

    In *that* version of the table, now. It has changed in the past,
    and it will change in the future.

    [omitting the lecture on abstract characters...]

    > So, show me one example in the GB18030, where there's a valid unbreakable
    > processing unit encoded, which does not represent a valid ISO/IEC10646
    > code-point/abstract-character, and I will accept your fact.

    I gave one in my last email. Here's another: GB 18030-2000 FEA0. That
    is a topside component of CJK characters. It can be seen, for example,
    as the top half of U+535B. GB 18030-2000 maps it to the PUA character
    U+E864 in Unicode (and 10646).

    The CJK component is not the same abstract character as an assigned
    PUA code point in 10646. It is *mapped* to that code point, but that
    is a separate issue. In fact, Unicode applications are perfectly free
    to interpret U+E864 as something entirely different (and often do).

    And the point should be driven home by the fact that the CJK component
    in question *has* been identified as an abstract character to be
    encoded in Unicode (and 10646), where it will be encoded as
    U+9FBB CJK UNIFIED IDEOGRAPH-9FBB. Once that happens, FEA0 in
    GB 18030-2000 will "represent a valid ISO/IEC 10646 code-point/
    abstract-character" as you put it. But U+E864 will, of course,
    still be a valid PUA code point in Unicode. It will no longer be
    mapped to FEA0.

    Any questions?

    > I hope this is
    > not the case, or compliance with the GB18030 standard requires more than
    > what we have read until now, because I have always assumed that GB18030
    > could be safely decoded/reencoded as a valid encoding scheme representing
    > Unicode/ISO/IEC10646 codepoints, without ambiguities or fallbacks.

    Please think again.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 15:38:08 CST