CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 15 2010 - 19:43:54 CST

  • Next message: Asmus Freytag: "Re: CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D"

    Asmus replied:

    > On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
    > >> FA47 is a "compatibility character", and would have a
    > >> compatibility mapping.
    > > Faulty syllogism.
    >
    > Formally correct answer but only because of something of a design flaw
    > in Unicode. When the type of mapping was decided on, people didn't fully
    > expect that NFC might become widely used/enforced, making these
    > distinctions appear wherever text is normalized in a distributed
    > architecture.

    O.k., I'm gonna have to intervene again. *hehe* Yes, there is
    a design flaw here, but Asmus' explanation is also somewhat
    faulty, because it flattens out the history in a way that is
    liable to be misunderstood.

    There is a *reason* why "when the type of mapping was decided on"
    that "people didn't fully expect that NFC might become
    widely used/enforced" -- but it wasn't that they were goofing
    up in understanding the implications of normalization. Rather,
    at that point in Unicode history NFC didn't *exist* yet, nor
    had the normalization algorithm been designed.

    Here, for the benefit of the standards geeks out there, are the
    relevant higlights of the historical timeline involved.

    June, 1992.

      The canonical mappings for the CJK Compatibility characters
      were *printed* (with off-by-one errors for some of them!) in
      Unicode 1.0, volume 2 (= Unicode 1.0.1).
      
      Actually, at the time, we didn't know they were "canonical"
      mappings, because that concept hadn't formally been invented
      yet, but the intention was clear. They were the mappings
      from the "CJK compatibility ideographs" to the "real" unified
      Han ideographs in the standard. The CJK compatibility characters
      were all considered to be duplicates in the source standards
      that didn't follow the unification rules.
      
    July, 1996.

      The formal definitions of "canonical decomposition" and
      "compatibility decomposition" were first published in
      Unicode 2.0. There wasn't a data file for the CJK Compatibility
      Ideographs block, but the canonical mappings were *printed*
      (correctly, this time) on pp. 7-470 to 7-472 of the standard.
      
    August 4, 1998.

      The first published version of UnicodeData.txt that contained
      the canonical mappings for the CJK Compatibility Ideographs
      was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually,
      they got into UnicodeData-2.1.4.txt on July 9, 1998, but that
      wasn't a published version of the data file.)
      
    July 23, 1999.

      This was the publication data of the first approved version
      of UAX #15 (Revision 15), and so is the first published definition
      of NFC. (Of course UAX #15 had been in draft for some time earlier
      than that, so the term "NFC" can be tracked back in the drafts
      to mid-1998.)
      
    September, 1999.

      Release of Unicode 3.0 -- the first release of Unicode formally
      tied to the Unicode Normalization Algorithm. (The revision
      of UAX #15 for the release was actually Revision 18, dated
      November 11, 1999.)
      
    March 23, 2001.

      UAX #15, Version 3.1.0. This was the version of the Unicode
      Normalization Algorithm that specified the composition version
      to be Version 3.1.0 and locked down normalization
      forever more.
      
    So essentially, there was a 9 year period between when the
    first mappings were defined for the CJK Compatibility Ideographs
    and the date beyond which it became impossible to reinterpret
    or change a canonical mapping because of the lockdown of
    normalization.

    The problems resulting from the normalization for CJK Compatibility
    Ideographs only started to become visible to people *after*
    the lockdown, and when Unicode normalization started to become
    a regular feature of actual processing.

    And it wasn't because "people didn't fully expect that NFC might
    become widely used/enforced" -- or at least not the people in
    the UTC. The UAX #15 text published with Unicode 3.0 already
    stated: "The W3C Character Model for the World Wide Web requires
    the use of Normalization Form C for XML and related standards..."

    And it wasn't because of some oversight about the canonical
    mappings involving the CJK Compatibility Ideographs per se.
    That same UAX #15 for Unicode 3.0 also stated: "With *all*
    normalization forms singleton characters (those with singleton
    canonical mappings) are replaced." So the ground facts for
    the FA10 --> (NFC/NFD/NFKC/NFKD) 585C normalization pattern
    were well-established and explicitly stated in 1999.

    > > FA47 is a CJK Compatibility character, which means it was encoded
    > > for compatibility purposes -- in this case to cover the round-trip
    > > mapping needed for JIS X 0213.
    > >
    > > However, it has a *canonical* decomposition mapping to U+6F22.
    >
    > And that, of course, destroys the desired "round-trip" behavior if it is
    > inadvertently applied while the data are encoded in Unicode. Hence the
    > need to recreate a solution to the issue of variant forms with a
    > different mechanism, the ideographic variation sequence (and
    > corresponding database).

    Yes, that is basically correct. But, this architectural "design flaw"
    actually results from two additional requirements that accrued
    to the Unicode Standard well after its initial design:

    1. The requirement to be able to carry "round-trip" behavior
       through distributed environments.
       
    In the original design, the notion of how one would deal with
    legacy data was conceived of primarily as a controlled and
    contained conversion issue. An application/system would convert
    legacy data to Unicode, and if it needed to convert back, it
    could use compatibility characters for round-trip conversion.
    The system would know how and when it could normalize, because
    it controlled the data and the conversion.

    2. The requirement to be able to maintain CJK variant glyph
       distinctions in plain text data.
       
    Again, that was not at all a part of the original Unicode
    Standard design.

    So the essential nature of the problem is that these new requirements
    have mostly accrued to Unicode implementations *after* 2001,
    more or less at the point when the lockdown of Unicode normalization
    made it impossible for normalization to be adjusted in any way
    to account for them.

    Hence the need to construct an *alternative* approach involving
    variation selectors, which would be robust and invariant under
    normalization transformations.
     
    > > The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
    > >
    > > Easily verified, for example, by checking the FA47 entry in
    > > NormalizationTest.txt in the UCD.
    >
    > While correct, it's something that remains a bit of a gotcha.

    Yeah, well, the basic gotcha is that no matter how many times
    I say it or what the Unicode Standard says, people will continue
    to just assume "compatibility character" implies "compatibility
    decomposition". For everybody on the list, I recommend
    frequent re-reading of Section 2.3, Compatibility Characters,
    of the standard:

    http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf

    whenever somebody mentions "compatibility" in discussion
    of Unicode. Yes, I suspect that people will find their
    heads hurting -- but this subject *is* complex, and generalizations
    that people make about "compatibility characters" are often
    wrong when they don't pay attention to the details.

    > Especially
    > now that Unicode has charts that go to great length showing the
    > different glyphs for these characters,

    Well, even there the issue is complicated, because there
    are CJK Compatibility Ideographs, and then there are
    CJK Compatibility Ideographs. They fall into at least
    3 important classes:

    1. Ones which really are *unified* ideographs, despite their
       names.
       
    2. Ones which are *pronunciation* variants from KS X 1001,
       and which are *not* intended to show different glyphs.
       
    3. Ones which are *graphical* variants from other legacy
       standards, and which *are* intended to show different
       glyphs.
       
    And even class 3 has subtypes, because some show variants
    that are distinguished only in one legacy standard, whereas
    some are themselves cross-mapped between more than one
    legacy standard -- putatively because each legacy standard
    shows the same variant glyph.

    It is class 3 that may be adversely affected *visually* by the
    application of normalization in a distributed environment.

    > I would suggest adding a note to
    > the charts that make clear that these distinctions are *removed* anytime
    > the text is normalized, which, in a distributed architecture may happen
    > anytime.

    The CJK Compatibility Ideographs already have warnings attached
    to them in the standard. They are repeatedly documented as "only
    for round-trip compatibility with XYZ" and "They should not
    be used for any other purpose."

    However, I think your point is a valid one. Now that the
    clear answer for maintaining legacy CJK glyph variant
    distinctions in a distributed environment is via ideographic
    variation sequences as registered in the IVD, it would make
    sense to beef up the CJK Compatibility Ideograph documentation
    with better pointers (and with accompanying rationale text)
    to UTS #37 and the IVD, and to post stronger warning labels
    in the code charts for CJK Compatibility Ideographs.

    Perhaps someone would like to make a detailed proposal to
    the UTC for how to fix the text and charts? ;-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2010 - 19:48:12 CST