Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: mpsuzuki@hiroshima-u.ac.jp
Date: Sun Mar 25 2007 - 19:15:14 CST

  • Next message: Don Osborn: "Guidelines for converting to Unicode use?"

    Dear Sir,

    Sorry for long delay to reply. Here I ask about
    the possibility to update UTS #37 about CJK
    Compatibility Ideographs: additional note in
    UTS #37 to state that IVS system is nothing
    to do with CJK Compatibility Ideographs and
    prohibit conversion between compatibility
    ideographs and IVS-qualified unified ideographs.
    # It divides <U+6674, U+E0100> and <U+FA12> etc,
    # although both ideograph variants are displayed
    # by CID+8481. There is already duplicated use
    # of CID+19071 at U+29FCE & U+29FD7, thus such
    # division is not fatal change, I guess.
    So I add Hideki Hiura to Cc list who is another
    author of UTS #37.

    On Tue, 20 Mar 2007 22:55:09 -0700
    Eric Muller <emuller@adobe.com> wrote:
    > mpsuzuki@hiroshima-u.ac.jp wrote:
    > > Comment 2: codepoints in CJK Compatibility Ideographs
    > > =====================================================
    > > I guess, the avoided
    > > codepoints are just "the out of scope" of IVD Adobe-Japan1
    > > (in fact, Unicode Technical Report #37 is written for CJK
    > > Unified Ideographs, no mention about CJK Compatibility
    > > Ideographs), and IVD Adobe-Japan1 does not concern the
    > > availability of ideographs at the avoided codepoints.
     
    > On the other hand, U+4FAE and U+FA30 have been made canonically
    > equivalent in Unicode. This is a priori a good choice, because those are
    > the same abstract character from Unicode's point of view (imagine they
    > are not encoded in JIS nor in Unicode, and you come to the IRG today
    > proposing to encode those two characters: you would get only one coded
    > character).

    I see, your reply is similar to what I expected during
    I was writing my comments. Your pointing out is RIGHT
    in principle. There is a group of CJK Compatibility
    Ideographs that is based on small glyph shape difference
    from corresponding CJK Unified Ideographs. IBM kanji
    and JIS X 0213 compatibility kanjis are such. Although
    it is arguable that Han Unification (ISO 10646 Annex S)
    is well defined rule or not, both specifications of
    ISO 10646 and Unicode seem to be against the addition
    of technical standards utilizing compatibility ideographs.

    > In fact, I would guess that if we had had the variation selectors
    > mechanism in place from the start, this mechanism would have been used
    > and the compatibility ideographs would not have been encoded.

    I AGREE. If there were VS mechanism from the start, Han
    Unification should be more systematic and exceptional
    characters for source code separation could be eliminated.

    > However, the canonical equivalence fundamentally negates the
    > round-tripping goal. Or more precisely: you can effectively round-trip
    > if and only if normalization is not applied to the Unicode data. With
    > today's larger and larger text and document processing systems, the
    > likelihood that none of the components will perform normalization is
    > getting lower and lower. So the effectiveness of the compatibility
    > ideographs is dubious at best.

    I'm sure that Adobe staffs are far familar than me, but
    please let me write in detail, to explain my interest.

    One of the reasons why I'm sticking to CJK Compatibility
    Ideograph is the clear statement of supported charset
    coverage.

    Followings are clear statement:
    * only JIS X 0208-19xx kanji is supported
    * only JIS X 0208-19xx + JIS X 0212-1990 kanjis are supported
    * Microsoft codepage 932 kanji is supported (slightly unclear?)

    Followings are NOT clear statement:
    * Microsoft codepage 932 kanji is supported
        except of CJK Compatibility Ideographs
    * JIS X 0213:20xx kanji is supported
        except of CJK Compatibility Ideographs
    * JIS X 0213:20xx kanji is supported
        except of CJK Unified Ideographs Extension B

    There's no 7 or 8bit encoding method for JIS X 0213
    which is interoperable with IBM or Microsoft codepage 932,
    there's no popular legacy encodings for JIS X 0213
    (even if we restrict the scope to JIS X 0213 level 3)
    that are widely used for information interchange in Japan.
    The most popular encoding to interchange JIS X 0213
    charset would be Unicode (including CJK Compatibility
    Ideographs). So, the seamless handling of CJK Compatibility
    Ideographs is important to support JIS X 0213, I think.

    If we cannot guarantee the roundtrip conversion of the
    CJK Compatibility Ideographs that Unicode expressions
    are different on IVS-unaware and IVS-aware systems,
    we have to insist the supported coverage of softwares
    as "JIS X 0213 without CJK Compatibility Ideographs".
    It is not clear statement.

    In previous post, I mentioned about NFD: normalization
    to JIS X 0208 + 0212 coverage, it may clarify the
    coverage of supported codepoints. But I've checked
    the list of characters should be normalized and I
    reconsidered. Such normalization would be hard work.
    # JIS X 0213 compatibility ideographs, 81 kanji, is
    # only the small part in new kanjis in JIS X 0213.
    # The 1st majority is 396 kanjis in CJK Unified Ideographs,
    # the 2nd majority is 303 kanjis in CJK Unified Ideographs Ext. B,
    # the last part is 80 kanjis in CJK Unified Ideographs Ext. A.
    # Normalization of such many CJK Unified Ideographs may be
    # high-handed approach and its normalization rule may be
    # quite ad-hoc and not intuitive.

    Another fix might be the separation of CIDs for CJK Unified
    Ideographs and CJK Compatibility Ideographs, even if their
    form is exactly same. But it will cause another issue,
    some kanji CIDs of Adobe-Japan1-6 are unavailable in IVS.
    It is another non-clear coverage of glyphset.

    As both fixes are not realistic, I wish if UTS #37 is updated
    to have additional note to prohibit (not deprecate) the
    codepoint conversion from CJK Compatibility Ideographs to
    CJK Unified Ideographs with IVS. How do you think?

    Regards,
    mpsuzuki



    This archive was generated by hypermail 2.1.5 : Sun Mar 25 2007 - 19:16:07 CST