GB18030 and Unicode/ISO/IEC 10646 mappings; the case of PUAs

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Fri Jan 07 2005 - 11:31:57 CST

Next message: Mike Ayers: "RE: ISO 10646 & GB18030 repetoire"

Previous message: saqqara: "Egyptian Hieroglyphs PUA implementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>From: "Peter Kirk"
> It is surely possible for Unicode, within its stability policy, to add new
> precomposed characters with canonical decompositions if these are also
> defined as composition exceptions, alongside existing Hebrew, Arabic etc
> presentation forms which are composition exceptions. The UTC might be
> reluctant to do so, but if it comes under strong pressure to do this from
> the Chinese standards body through WG2, I see no compelling reason why the
> UTC should refuse this. There is nothing in the stability policy to force
> it to refuse new presentation forms of this kind.

The Unicode stability policy is one thing; but more importantly, it does not
concern much ISO/IEC 10646, for which the interoperability with the
Chinese GB18030 standard can't be ignored. Of course there will be
negociations and lots of arguments for or against such extension proposals
by the Chinese government. But if China ever wants to amend its GB18030
standard, it can do so. This will not be without impact to the computer
industry, if the new amendment must become part of the support required
for applications sold in China.

However I think that this does not require changing the complete
interoperability between GB18030 and ISO/IEC 10646. China could simply
define a new further standard, that they will see as a natural extension
of GB18030, even if the compatibility mechanisms that allow interoperability
between GB18030 and its successor are not guaranteed to work without
change with 10646: one could define a new mapping for the new standard
and ISO/IEC 10646, but I don't see any compelling argument to do that:
if a new Chinese standard must interoperate with ISO/IEC 10646 and the
previous standard, it will need to map all existing codepoints already
supported in GB18030, which are already the *whole* codepoints set
of ISO/IEC 10646. Such extension would then not be at the codepoint
level, but at the level of sequences of codepoints defined in the new
standard. As such sequences are out of scope of ISO/IEC 10646 which
only works at the individual codepoint level, the impact would be instead
in the other standards that use ISO/IEC 10646, notably in Unicode.

What this means is that the Unicode stability policy would have to be
amended to allow the interoperability required by the new Chinese
standard, but there would probably be no change in ISO/IEC 10646.
The affected areas would most probably be:
- the Unicode-defined canonical mappings for Han.
- the Unicode-defined compatibility mappings for Han.
- some new (normative?) character properties that Unicode would need
to define and include.
- the combining classes
- general categories (these are not impacted by the Unicode stability
requirement)
- the representative glyphs (Unicode can change them in case of
errors, or in case of ambiguity with new encoded characters, or if
this can confuse the implementation of interoperability with other
important standards like GB18030 or new ISO standards).

Such changes would not affect the conformance needed in the
putative "EU law" discussed here, related to the required
compliance with ISO/IEC 10646. Until now, nobody has shown
a definitive reference to such EU text (Law or Directive?)...
----------------------------------------------------------------------
>From: "Christopher Fynn"
> I suspect that support of at least the combinations in Group A will become
> a requirement for some levels of GB18030 compliance.
>
> There seems to be one defect - the charts I've seen seem to contain a
> pre-composed character equivalent to the combination U+0F68 U+0F7C
> U+0F7E - It appears they've assumed that U+0F00 can be used as the
> equivalent to that string. However in Unicode U+0F00 is *not* equivalent
> to U+0F68 U+0F7C U+0F7E (U+0F00 has no de-composition). I think this
> means that there would be no round-trip compatibility for this
> combination.

Yes but as such additions will go into the scope of PUA usage agreements,
the simple fact of saying that a text is GB18030 encoded will imply the
acceptation of this PUA agreement.
So what is the problem? The fact that when the GB18030 text will be
converted to Unicode, the trace of the PUA agreement will be lost. This
could be solved by external markup or labelling, to explicitly transport the
acceptation of the new GB18030 required agreement.

With this information, a Unicode compliant application could tailor its
internal processing of PUA to match the new GB18030 requirements (including
the possibility to detect new equivalences, even if they are not directly
accessible with the reference to the Unicode standard alone, i.e. the text
encoding label).

MIME allows encoding such PUA agremeents for text labelling: you can specify
more than the MIME type or the charset, other attributes can be appended.
For example,
    Content-Type: text/plain; charset=UTF-8; pua-agreement=GB18030
instead of just:
    Content-Type: text/plain; charset=UTF-8
would allow converting safely a GB18030-encoded text with this label:
    Content-Type: text/plain; charset=GB18030
Note that the additional "pua-agreement" attribute is my invention; there may exist some other assigned attribute names in MIME to allow interoperating with various PUA conventions used in plain-text. I don't know if Unicode has studied a way to standardise such explicit labelling with those in charge of the MIME content-types registry.
But as PUAs must be open to everyone, there's a need to allow specifying multiple usages:
- specifying the PUA convention with a URI (URN or URL)
- specifying the convention as a reference to a known standard (like here GB18030)
- specifying multiple PUA conventions, using a comma-separated list of PUA convention identifiers (URI or standard name)
- adding some more fields to the attribute value for each convention to set their base encoding offset, if multiple PUA conventions are used simulataneously in the same text, where those PUAs would collide

Another example: converting a MacRoman text to Unicode normally requires mapping the AppleLogo into a PUA. However the PUA assignment is specific to Apple, so this would be something like transforming this:
    Content-Type: text/plain; charset=MacRoman
into that (using an URL):
    Content-Type: text/plain; charset=UTF-8; pua-agreement=domain:www.apple.com
or that (using a reference to the Apple MacRoman charset, and its associated mapping to Unicode):
    Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman

Now mixing such text with other text initially coded with GB18030 into a single UTF-8 encoded text would be:
Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman,GB18030
only if there's no collision in the PUA codepoints used in each referenced standard. If there's such a collision, there should exist a way to specify an alternate starting codepoint for the PUA allocated in each convention:
Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman@F800,GB18030@E000

In addition, there's a need to describe more formally which PUA range are used in each pua-convention: is there for example an XML schema to specify it, and that would specify the default value for the base PUA codepoint, but where the defined PUA ranges would be now set as relative offsets to this base codepoints? How can such schema be retreived? Is there some attempts to create a repository, or a URN search scheme to locate these definition files?
Who has made such experimentations? Are there ongoing standards for interoperability?

----------------------------------------------------------------------
>From: "Kenneth Whistler"
> O.k.
>
> Example 1:
>
> GB 18030-2000 defines a CJK component at FE90 and maps that
> component to U+E854, because that component is not encoded
> in Unicode 3.0 or ISO/IEC 10646-1:2000.

So it gives:
GB1830(FE90) <--> U+E854

> Because such PUA mappings for GB 18030-2000 have proven
> very problematical in implementations, the characters in
> question have been added to 10646 (under ballot currently
> in Amd 1 to ISO/IEC 10646:2003). This particular CJK component
> is to be encoded at U+9FBA.
>
> And this means that GB 18030 / Unicode mapping tables up
> to about March 31, 2005 will contain the mappings:
>
> FE90 <--> U+E854
> 82359133 <--> U+9FBA

How can this second line come magically? Has this been already approved and
added in ISO/IEC 10646?

> After that time, they will contain the mappings:
>
> ???? <--> U+E854
> FE90 <--> U+9FBA
> 82359133 <--> ???? (probably U+FFFD)

Here I don't understand: why would the first line become changed this way?
If GB18030 wants to keep a maximum compatibility it should be:

  FE90 <--- U+E854
and also:
  82359133 --> U+9FBA
even if their reverse operation will now be:
  FE90 <--> U+9FBA
meaning that the updated GB18030 standard defines new custom
"pseudo-canonical" mappings:
  in GB18030: FE90 <--- 82359133
  in ISO10646: U+9FBA <--- U+E854
For Unicode/ISO/IEC 10646, only the second line above is interesting, but it
only depends on the private convention used by the new GB18030 standard,
where the U+E854 codepoint should be treated equivalently to the new
standardized code point U+9FBA.

In addition, if an application does not change its existing mapping between
GB18030 and ISO/IEC 10646, it will generate GB18030 encoded files with
GB<82359133> sequences. Then a chinese application that will comply to the
new version of the standard will know that this GB<82359133> sequence is a
equivalent encoding for the now prefered shorter code GB, that other
new GB applications may or may not recognize.

If they don't recognize it (for example fonts built with just GB code
mappings, ignoring Unicode codepoints), they will display a square box (but
the same font is unlikely to contain a glyph mapping for the previous code
GB<82359133>, as such glyph behavior would exist only if the font was made
in reference to the new standard that prefers GB)

In another case, a renderer that would use fonts built only for ISO/IEC10646
glyph mappings (TrueType and similar...) would only map a glyph at the new
codepoint U+9FBA. Support for the new version of GB18030 will include the
mappings:
  FE90 <--- U+E854
  82359133 --> U+9FBA
  FE90 <--> U+9FBA
but renderers will only need to use the last two lines to get the
appropriate glyph (in fact this will not be performed in renderers that will
only use the U+9FBA codepoint to get to that glyph, but by the updated
GB18030 decoder that maps GB18030 codes to ISO/IEC 10646 codepoints, using
only the last two lines...)

The bad thing is that a new application loading a GB18030 text and that
translates it internally to Unicode, will generate new GB18030 text files
containing GB that, when transmitted to older GB-enabled applications
will consider as distinct from GB<82359133>, given that they map to two
distinct Unicode codepoints with no known equivalence.

All this is a nightmare for application providers, so it seems important to
explain what is mandatory in the Chinese rules: which version of GB18030 is
mandatory in products sold in China? If a new version becomes implicitly
mandatory, there's the requirement for applications to correctly indicate
which version of the GB18030 standard they support, even if they already
comply to ISO/IEC 10646 and its successors.

That's really bad news, if the Chinese regulation implies support of all
successors to the GB18030 standard (I really thought that the mapping was
closed for year 2000, but unfortunately this appears not being the case).

Who's to blame? Only China if it introduces mandatory PUA characters in its
GB18030 charset, and requires that GB18030 applications must support them. I
hope that applications do not need the mandatory support of these PUAs (and
their implied pseudo-equivalences), and instead just need to support the
mapping of standard non-PUA ISO/IEC 10646 codepoints.

If an application just supports non PUA mappings for GB18030, it will remain
compatible with the new standard, even if it fails to recognize the
*temporary* PUA mappings that were introduced too early in GB18030:2000.

The tricky cases you expose above where PUAs can become non-PUA in GB18030
should be ignored. And I hope that China does not expect that applications
magically recognize these conversions (which are based on mappings that are
not defined by Unicode or ISO/IEC 10646 as canonical mappings!) If this is
the case, then GB18030-compliant applications (and their bundled
GB18030<>codepoint mappings) need not be changed, and the GB18030<>codepoint
mapping will appear and remain as already closed!

But if this is not the case, software makers and text producers are in lots
of troubles, and there will exist the need to make applications that only
work with GB18030, without going through any Unicode/ISO/IEC/10646 mapping!
(This will exclude XML, and even HTML or SGML for handling Chinese texts and
data... I doubt that China wants to create substitute standards for HTML,
XML, or SGML, and is ready to accept the huge cost implied by requiring from
software providers that they support many new parallel standards! This
really foes against the data interchangeability focus of ISO/IEC 10646 and
worldwide open standards!)

Next message: Mike Ayers: "RE: ISO 10646 & GB18030 repetoire"
Previous message: saqqara: "Egyptian Hieroglyphs PUA implementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 07 2005 - 11:45:57 CST