GB18030 and Unicode/ISO/IEC 10646 mappings; the case of PUAs

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Fri Jan 07 2005 - 11:31:57 CST

  • Next message: Mike Ayers: "RE: ISO 10646 & GB18030 repetoire"

    >From: "Peter Kirk"
    > It is surely possible for Unicode, within its stability policy, to add new
    > precomposed characters with canonical decompositions if these are also
    > defined as composition exceptions, alongside existing Hebrew, Arabic etc
    > presentation forms which are composition exceptions. The UTC might be
    > reluctant to do so, but if it comes under strong pressure to do this from
    > the Chinese standards body through WG2, I see no compelling reason why the
    > UTC should refuse this. There is nothing in the stability policy to force
    > it to refuse new presentation forms of this kind.

    The Unicode stability policy is one thing; but more importantly, it does not
    concern much ISO/IEC 10646, for which the interoperability with the
    Chinese GB18030 standard can't be ignored. Of course there will be
    negociations and lots of arguments for or against such extension proposals
    by the Chinese government. But if China ever wants to amend its GB18030
    standard, it can do so. This will not be without impact to the computer
    industry, if the new amendment must become part of the support required
    for applications sold in China.

    However I think that this does not require changing the complete
    interoperability between GB18030 and ISO/IEC 10646. China could simply
    define a new further standard, that they will see as a natural extension
    of GB18030, even if the compatibility mechanisms that allow interoperability
    between GB18030 and its successor are not guaranteed to work without
    change with 10646: one could define a new mapping for the new standard
    and ISO/IEC 10646, but I don't see any compelling argument to do that:
    if a new Chinese standard must interoperate with ISO/IEC 10646 and the
    previous standard, it will need to map all existing codepoints already
    supported in GB18030, which are already the *whole* codepoints set
    of ISO/IEC 10646. Such extension would then not be at the codepoint
    level, but at the level of sequences of codepoints defined in the new
    standard. As such sequences are out of scope of ISO/IEC 10646 which
    only works at the individual codepoint level, the impact would be instead
    in the other standards that use ISO/IEC 10646, notably in Unicode.

    What this means is that the Unicode stability policy would have to be
    amended to allow the interoperability required by the new Chinese
    standard, but there would probably be no change in ISO/IEC 10646.
    The affected areas would most probably be:
    - the Unicode-defined canonical mappings for Han.
    - the Unicode-defined compatibility mappings for Han.
    - some new (normative?) character properties that Unicode would need
    to define and include.
    - the combining classes
    - general categories (these are not impacted by the Unicode stability
    requirement)
    - the representative glyphs (Unicode can change them in case of
    errors, or in case of ambiguity with new encoded characters, or if
    this can confuse the implementation of interoperability with other
    important standards like GB18030 or new ISO standards).

    Such changes would not affect the conformance needed in the
    putative "EU law" discussed here, related to the required
    compliance with ISO/IEC 10646. Until now, nobody has shown
    a definitive reference to such EU text (Law or Directive?)...
    ----------------------------------------------------------------------
    >From: "Christopher Fynn"
    > I suspect that support of at least the combinations in Group A will become
    > a requirement for some levels of GB18030 compliance.
    >
    > There seems to be one defect - the charts I've seen seem to contain a
    > pre-composed character equivalent to the combination U+0F68 U+0F7C
    > U+0F7E - It appears they've assumed that U+0F00 can be used as the
    > equivalent to that string. However in Unicode U+0F00 is *not* equivalent
    > to U+0F68 U+0F7C U+0F7E (U+0F00 has no de-composition). I think this
    > means that there would be no round-trip compatibility for this
    > combination.

    Yes but as such additions will go into the scope of PUA usage agreements,
    the simple fact of saying that a text is GB18030 encoded will imply the
    acceptation of this PUA agreement.
    So what is the problem? The fact that when the GB18030 text will be
    converted to Unicode, the trace of the PUA agreement will be lost. This
    could be solved by external markup or labelling, to explicitly transport the
    acceptation of the new GB18030 required agreement.

    With this information, a Unicode compliant application could tailor its
    internal processing of PUA to match the new GB18030 requirements (including
    the possibility to detect new equivalences, even if they are not directly
    accessible with the reference to the Unicode standard alone, i.e. the text
    encoding label).

    MIME allows encoding such PUA agremeents for text labelling: you can specify
    more than the MIME type or the charset, other attributes can be appended.
    For example,
        Content-Type: text/plain; charset=UTF-8; pua-agreement=GB18030
    instead of just:
        Content-Type: text/plain; charset=UTF-8
    would allow converting safely a GB18030-encoded text with this label:
        Content-Type: text/plain; charset=GB18030
    Note that the additional "pua-agreement" attribute is my invention; there may exist some other assigned attribute names in MIME to allow interoperating with various PUA conventions used in plain-text. I don't know if Unicode has studied a way to standardise such explicit labelling with those in charge of the MIME content-types registry.
    But as PUAs must be open to everyone, there's a need to allow specifying multiple usages:
    - specifying the PUA convention with a URI (URN or URL)
    - specifying the convention as a reference to a known standard (like here GB18030)
    - specifying multiple PUA conventions, using a comma-separated list of PUA convention identifiers (URI or standard name)
    - adding some more fields to the attribute value for each convention to set their base encoding offset, if multiple PUA conventions are used simulataneously in the same text, where those PUAs would collide

    Another example: converting a MacRoman text to Unicode normally requires mapping the AppleLogo into a PUA. However the PUA assignment is specific to Apple, so this would be something like transforming this:
        Content-Type: text/plain; charset=MacRoman
    into that (using an URL):
        Content-Type: text/plain; charset=UTF-8; pua-agreement=domain:www.apple.com
    or that (using a reference to the Apple MacRoman charset, and its associated mapping to Unicode):
        Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman

    Now mixing such text with other text initially coded with GB18030 into a single UTF-8 encoded text would be:
        Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman,GB18030
    only if there's no collision in the PUA codepoints used in each referenced standard. If there's such a collision, there should exist a way to specify an alternate starting codepoint for the PUA allocated in each convention:
        Content-Type: text/plain; charset=UTF-8; pua-agreement=MacRoman@F800,GB18030@E000

    In addition, there's a need to describe more formally which PUA range are used in each pua-convention: is there for example an XML schema to specify it, and that would specify the default value for the base PUA codepoint, but where the defined PUA ranges would be now set as relative offsets to this base codepoints? How can such schema be retreived? Is there some attempts to create a repository, or a URN search scheme to locate these definition files?
    Who has made such experimentations? Are there ongoing standards for interoperability?

    ----------------------------------------------------------------------
    >From: "Kenneth Whistler"
    > O.k.
    >
    > Example 1:
    >
    > GB 18030-2000 defines a CJK component at FE90 and maps that
    > component to U+E854, because that component is not encoded
    > in Unicode 3.0 or ISO/IEC 10646-1:2000.

    So it gives:
    GB1830(FE90) <--> U+E854

    > Because such PUA mappings for GB 18030-2000 have proven
    > very problematical in implementations, the characters in
    > question have been added to 10646 (under ballot currently
    > in Amd 1 to ISO/IEC 10646:2003). This particular CJK component
    > is to be encoded at U+9FBA.
    >
    > And this means that GB 18030 / Unicode mapping tables up
    > to about March 31, 2005 will contain the mappings:
    >
    > FE90 <--> U+E854
    > 82359133 <--> U+9FBA

    How can this second line come magically? Has this been already approved and
    added in ISO/IEC 10646?

    > After that time, they will contain the mappings:
    >
    > ???? <--> U+E854
    > FE90 <--> U+9FBA
    > 82359133 <--> ???? (probably U+FFFD)

    Here I don't understand: why would the first line become changed this way?
    If GB18030 wants to keep a maximum compatibility it should be:

      FE90 <--- U+E854
    and also:
      82359133 --> U+9FBA
    even if their reverse operation will now be:
      FE90 <--> U+9FBA
    meaning that the updated GB18030 standard defines new custom
    "pseudo-canonical" mappings:
      in GB18030: FE90 <--- 82359133
      in ISO10646: U+9FBA <--- U+E854
    For Unicode/ISO/IEC 10646, only the second line above is interesting, but it
    only depends on the private convention used by the new GB18030 standard,
    where the U+E854 codepoint should be treated equivalently to the new
    standardized code point U+9FBA.

    In addition, if an application does not change its existing mapping between
    GB18030 and ISO/IEC 10646, it will generate GB18030 encoded files with
    GB<82359133> sequences. Then a chinese application that will comply to the
    new version of the standard will know that this GB<82359133> sequence is a
    equivalent encoding for the now prefered shorter code GB, that other
    new GB applications may or may not recognize.

    If they don't recognize it (for example fonts built with just GB code
    mappings, ignoring Unicode codepoints), they will display a square box (but
    the same font is unlikely to contain a glyph mapping for the previous code
    GB<82359133>, as such glyph behavior would exist only if the font was made
    in reference to the new standard that prefers GB)

    In another case, a renderer that would use fonts built only for ISO/IEC10646
    glyph mappings (TrueType and similar...) would only map a glyph at the new
    codepoint U+9FBA. Support for the new version of GB18030 will include the
    mappings:
      FE90 <--- U+E854
      82359133 --> U+9FBA
      FE90 <--> U+9FBA
    but renderers will only need to use the last two lines to get the
    appropriate glyph (in fact this will not be performed in renderers that will
    only use the U+9FBA codepoint to get to that glyph, but by the updated
    GB18030 decoder that maps GB18030 codes to ISO/IEC 10646 codepoints, using
    only the last two lines...)

    The bad thing is that a new application loading a GB18030 text and that
    translates it internally to Unicode, will generate new GB18030 text files
    containing GB that, when transmitted to older GB-enabled applications
    will consider as distinct from GB<82359133>, given that they map to two
    distinct Unicode codepoints with no known equivalence.

    All this is a nightmare for application providers, so it seems important to
    explain what is mandatory in the Chinese rules: which version of GB18030 is
    mandatory in products sold in China? If a new version becomes implicitly
    mandatory, there's the requirement for applications to correctly indicate
    which version of the GB18030 standard they support, even if they already
    comply to ISO/IEC 10646 and its successors.

    That's really bad news, if the Chinese regulation implies support of all
    successors to the GB18030 standard (I really thought that the mapping was
    closed for year 2000, but unfortunately this appears not being the case).

    Who's to blame? Only China if it introduces mandatory PUA characters in its
    GB18030 charset, and requires that GB18030 applications must support them. I
    hope that applications do not need the mandatory support of these PUAs (and
    their implied pseudo-equivalences), and instead just need to support the
    mapping of standard non-PUA ISO/IEC 10646 codepoints.

    If an application just supports non PUA mappings for GB18030, it will remain
    compatible with the new standard, even if it fails to recognize the
    *temporary* PUA mappings that were introduced too early in GB18030:2000.

    The tricky cases you expose above where PUAs can become non-PUA in GB18030
    should be ignored. And I hope that China does not expect that applications
    magically recognize these conversions (which are based on mappings that are
    not defined by Unicode or ISO/IEC 10646 as canonical mappings!) If this is
    the case, then GB18030-compliant applications (and their bundled
    GB18030<>codepoint mappings) need not be changed, and the GB18030<>codepoint
    mapping will appear and remain as already closed!

    But if this is not the case, software makers and text producers are in lots
    of troubles, and there will exist the need to make applications that only
    work with GB18030, without going through any Unicode/ISO/IEC/10646 mapping!
    (This will exclude XML, and even HTML or SGML for handling Chinese texts and
    data... I doubt that China wants to create substitute standards for HTML,
    XML, or SGML, and is ready to accept the huge cost implied by requiring from
    software providers that they support many new parallel standards! This
    really foes against the data interchangeability focus of ISO/IEC 10646 and
    worldwide open standards!)



    This archive was generated by hypermail 2.1.5 : Fri Jan 07 2005 - 11:45:57 CST