Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: Eric Muller (
Date: Tue Mar 20 2007 - 23:55:09 CST

  • Next message: Andrew West: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)" wrote:
    > Comment 2: codepoints in CJK Compatibility Ideographs
    > =====================================================

    > I guess, the avoided
    > codepoints are just "the out of scope" of IVD Adobe-Japan1
    > (in fact, Unicode Technical Report #37 is written for CJK
    > Unified Ideographs, no mention about CJK Compatibility
    > Ideographs), and IVD Adobe-Japan1 does not concern the
    > availability of ideographs at the avoided codepoints.
    The explicit requirement in UTS37 by the first code point of an IVS be a
    character with the Unified_Ideograph property (by opposition to "being
    an ideograph", e.g. in one of the CJK ideograph blocks) is of course a
    deliberate choice. The fact is that the compatibility ideographs (those
    with a canonical decomposition, i.e. not including U+FA0E and its eleven
    friends) are awkward, and barely serving their purpose.

    One the one hand, they have been introduced in Unicode to facilitate
    round-tripping with other standards. For example, JIS X 0208 + JIS X
    0213 encodes both 41-78 and 1-14-24, and having correspondingly U+4FAE
    and U+FA30 in Unicode means that the distinction established by JIS can
    a priori be preserved when going through Unicode.

    On the other hand, U+4FAE and U+FA30 have been made canonically
    equivalent in Unicode. This is a priori a good choice, because those are
    the same abstract character from Unicode's point of view (imagine they
    are not encoded in JIS nor in Unicode, and you come to the IRG today
    proposing to encode those two characters: you would get only one coded

    However, the canonical equivalence fundamentally negates the
    round-tripping goal. Or more precisely: you can effectively round-trip
    if and only if normalization is not applied to the Unicode data. With
    today's larger and larger text and document processing systems, the
    likelihood that none of the components will perform normalization is
    getting lower and lower. So the effectiveness of the compatibility
    ideographs is dubious at best.

    In the IVD world, we can have our cake and eat it too: we can represent
    the difference between 41-78 and 1-14-24 by having two sequences based
    on U+4FAE. Those two sequences are not canonically equivalent so we are
    fine on that front; and the ignorable nature of the variation selectors
    means that we recognize the fundamental equivalence (in a pure Unicode
    point of view) of 41-78 and 1-14-24. Thus there is no need to define
    sequences using the compatibility ideographs, and we avoid the problems
    of normalization.

    In fact, I would guess that if we had had the variation selectors
    mechanism in place from the start, this mechanism would have been used
    and the compatibility ideographs would not have been encoded.

    > However, if we use IVD Adobe-Japan1 in ToUnicode mapping
    > tables in PDF using Adobe-Japan1 CID font, it can cause
    > a round-trip issue. For example, if I make a PDF from
    > JIS X 0213 text, with Adobe CID font, and insert ToUnicode
    > mapping tables including IVS of IVD Adobe-Japan1,
    > the receiver of PDF file can retrieve JIS X 0208 (and/or
    > 0212) text from the PDF, but cannot retrieve original JIS
    > X 0213 text.
    Start with the sequence of JIS code points <41-78, 1-14-24>. Turn that
    into a PDF using a AJ1 CID font, the PDF contains CIDs 3552 and 13382
    (and no direct trace of the JIS code points). Use the registered
    sequences <4FAE, E0100> and <4FAE,E0101> in the ToUnicode map. If you
    want to go to JIS, turn <4FAE, E0100> into 41-78, and <4FAE, E0101> into

    Compare with the current scenario: the ToUnicode map contains <4FAE> and
    <FA30>; any normalization on that reduces both to <4FAE>, and certainly
    you cannot recover your original JIS code points.

    Granted, you need new mappings from Unicode to JIS, but they are immune
    to normalization problems.


    This archive was generated by hypermail 2.1.5 : Tue Mar 20 2007 - 23:59:23 CST