Re: CJK Unified Ideographs Range

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 19 2003 - 17:37:03 EST

  • Next message: Werner LEMBERG: "Re: [OpenType] PS glyph `phi' vs `phi1'"

    Andrew asked:

    > I've asked this question before, but I've never had a satisfactory response, so
    > I'll ask it again now that Unicode 4 is due to be released soon.
    >
    > Section 10.1 of the Unicode Standard, as well as Blocks-4.0.0.txt, give the
    > range of the CJK Unified Ideographs block as U+4E00 through U+9FFF, whereas at
    > the top of the CJK Unified Ideographs code chart it clearly states "Range:
    > 4E009FAF", and does not show the columns 9FB0-9FBF, 9FC0-9FCF, 9FD0-9FDF,
    > 9FE0-9FEF and 9FF0-9FFF. Is there a reason for this discrepancy ?
    >
    > Given that new CJK unified ideographs are added to supplementary CJK blocks
    > (CJK-A, CJK-B and CJK-C), and I understand that no more characters are intended
    > to be added to the basic CJK block, why then are U+9FB0 through U+9FFF reserved
    > for the CJK Unified Ideographs block ? Surely these eighty code points would be
    > better utilised if freed for use by new scripts.

    The UTC dealt with this issue of block boundaries back in October, 2001,
    in the context of the review of Blocks.txt for Unicode 3.2. There
    is mention of this issue and the changes made in Article VII of
    UAX #28, Unicode 3.2.

    In particular, the inconsistency in block ending range handling for
    CJK Unified Ideographs versus the Hangul and Extension A and Extension B
    blocks was resolved in favor of ending each block on a "round" hex
    boundary, i.e. at XXXF, regardless of whether that was the last character
    in the block or not. The extra "space" of reserved code points in
    the CJK Unified Ideographs block is an artifact of block decisions made
    way back in 1992, well before the BMP looked as tight as it does now.

    In case you are interested, the particular anomaly regarding the end of
    the CJK Unified Ideographs block versus the header printed in the code
    charts is just one of thirteen different types of anomalies that I
    analyzed and reported on for the 2001 UTC discussion. Below is the
    relevant excerpt.

    --Ken

    <quote from L2/01-412>

    Title: Response to L2/01-419 Block Boundary Fixes
    Author: Ken Whistler
    Date: October 30, 2001

    Mark Davis has suggested a number of fixes to Blocks.txt, to
    eliminate some inconsistencies and to try to establish an
    invariant that all block boundaries end on an XXXF boundary.
    As usual, in all things Unicode-related, there are some
    worms (I'm not sure whether they should be considered big
    wriggly earthworms or just nematodes) in this can.

    So as a response to The Great Innovator (Mark), The
    Great Disinnovator (me), has assembled the analysis below of
    *all* anomalies in block names. These fall into 13 distinct
    types, for each of which I give a separate analysis and
    a suggested disposition.

    In some instances, I think Mark's suggestions are fine, but
    in other cases, I'd rather we left well-enough alone and
    abandoned the quest for the invariant.
     
    </quote from L2/01-412>

    By the way, I lost that particular argument. The UTC *did*
    decide to end all the blocks on an XXXF boundary, and that
    change was made for Unicode 3.2. Anyone wanting to examine
    the resultant changes in detail can compare:

    http://www.unicode.org/Public/3.1-Update/Blocks-4.txt

    with

    http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt

    What follows is my assessment of Anomaly Type #11, which
    was the one Andrew was referring to, describing the technical
    production reason for the way the header is constructed in
    NamesList.txt.

    <quote from L2/01-412>

    ================================================================

    TYPE 11: Block ranges match in Unicode and 10646, for
    blocks with generated character names, but NamesList.txt
    shows a mismatched range.

    4E00 CJK Unified Ideographs 9FA5
    4E00..9FFF; CJK Unified Ideographs
    CJK UNIFIED IDEOGRAPHS 4E00-9FFF

    Analysis: The range distinction in NamesList.txt is deliberate,
    to enable calculation of the cutoff point in the charts,
    where there are no actual character name entries in NamesList.txt
    to drive this.

    Suggested resolution: No action.

    ================================================================

    </quote from L2/01-412>



    This archive was generated by hypermail 2.1.5 : Wed Feb 19 2003 - 18:25:27 EST