Re: ISO 10646 compliance and EU law

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 06 2005 - 15:34:45 CST

Next message: Mark Davis: "Re: Re: ISO 10646 compliance and EU law"

Previous message: Kenneth Whistler: "Re: Re: ISO 10646 compliance and EU law"
Maybe in reply to: Philipp Reichmuth: "Re: ISO 10646 compliance and EU law"
Next in thread: E. Keown: "Re: ISO 10646 compliance and EU law"
Maybe reply: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"
Maybe reply: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"
Maybe reply: Kenneth Whistler: "Re: Re: ISO 10646 compliance and EU law"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> >> For example, an application exchanging data encoded with the GB18030
> >> charset
> >> will be conforming, provided that it restricts itself to using only the
> >> intersection of the GB18030 repertoire and the ISO/IEC 10646 repertoire.
> >
> > This is false. An application exchanging data encoded with GB18030 may
> > be conforming to the GB18030 standard, but it is not thereby conformant
> > to ISO/IEC 10646. If it exchanges a LATIN LETTER A WITH ACUTE with
> > the byte sequence <A8 A2>, then it is indeed conforming to GB18030, but
> > that is not a conformant representation of LATIN LETTER A WITH ACUTE
> > in any encoding form for ISO/IEC 10646 (or the Unicode Standard).
> >
> > You are confusing the possibility of interoperability between GB18030
> > data (and applications) and Unicode data (and applications) with
> > the issues of conformance to particular standards.
>
> Isn't the GB18030 encoding <A8 A2> *mapped* to U+00E1 (LATIN SMALL LETTER A
> WITH ACUTE)?

Of course.

> Or are you saying that this GB18030 sequence does not make the distinction
> between small and capital Latin letters?

No, of course not.

If an application is representing LATIN SMALL LETTER A WITH ACUTE as
<A8 A2>, then it is conforming with GB 18030-2000. (And also,
not coincidentally, GB 2312-1980 and Microsoft Code Page 936.)

If an application is representing LATIN SMALL LETTER A WITH ACUTE as
U+00E1 (<0xC3 0x91>, 0x00E1, 0x000000E1, depending on encoding form),
then it is conforming with the Unicode Standard (and ISO/IEC 10646:2003).

If an application is mapping between the two, then it is interoperating.
But the fact that a mapping table exists does not demonstrate that
a Unicode application itself is conforming to the Unicode Standard.

> When I look into the standard GB18030 mapping file (or even if I use only
> the MS Windows 936 Chinese PRC charset mapping, which is an extension of
> GB2312 that includes a part of the GB18030 standard), there's absolutely no
> ambiguity to which abstract ISO/IEC 10646 character it corresponds: i.e. its
> codepoint.

In *that* version of the table, now. It has changed in the past,
and it will change in the future.

[omitting the lecture on abstract characters...]

> So, show me one example in the GB18030, where there's a valid unbreakable
> processing unit encoded, which does not represent a valid ISO/IEC10646
> code-point/abstract-character, and I will accept your fact.

I gave one in my last email. Here's another: GB 18030-2000 FEA0. That
is a topside component of CJK characters. It can be seen, for example,
as the top half of U+535B. GB 18030-2000 maps it to the PUA character
U+E864 in Unicode (and 10646).

The CJK component is not the same abstract character as an assigned
PUA code point in 10646. It is *mapped* to that code point, but that
is a separate issue. In fact, Unicode applications are perfectly free
to interpret U+E864 as something entirely different (and often do).

And the point should be driven home by the fact that the CJK component
in question *has* been identified as an abstract character to be
encoded in Unicode (and 10646), where it will be encoded as
U+9FBB CJK UNIFIED IDEOGRAPH-9FBB. Once that happens, FEA0 in
GB 18030-2000 will "represent a valid ISO/IEC 10646 code-point/
abstract-character" as you put it. But U+E864 will, of course,
still be a valid PUA code point in Unicode. It will no longer be
mapped to FEA0.

Any questions?

> I hope this is
> not the case, or compliance with the GB18030 standard requires more than
> what we have read until now, because I have always assumed that GB18030
> could be safely decoded/reencoded as a valid encoding scheme representing
> Unicode/ISO/IEC10646 codepoints, without ambiguities or fallbacks.

Please think again.

--Ken

Next message: Mark Davis: "Re: Re: ISO 10646 compliance and EU law"
Previous message: Kenneth Whistler: "Re: Re: ISO 10646 compliance and EU law"
Maybe in reply to: Philipp Reichmuth: "Re: ISO 10646 compliance and EU law"
Next in thread: E. Keown: "Re: ISO 10646 compliance and EU law"
Maybe reply: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"
Maybe reply: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"
Maybe reply: Kenneth Whistler: "Re: Re: ISO 10646 compliance and EU law"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 15:38:08 CST