Re: CJK Unified Ideographs Range

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 19 2003 - 17:37:03 EST

Next message: Werner LEMBERG: "Re: [OpenType] PS glyph `phi' vs `phi1'"

Previous message: Marion Gunn: "A new font called Gentium"
Maybe in reply to: Andrew C. West: "CJK Unified Ideographs Range"
Next in thread: Andrew C. West: "Re: CJK Unified Ideographs Range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Andrew asked:

> I've asked this question before, but I've never had a satisfactory response, so
> I'll ask it again now that Unicode 4 is due to be released soon.
>
> Section 10.1 of the Unicode Standard, as well as Blocks-4.0.0.txt, give the
> range of the CJK Unified Ideographs block as U+4E00 through U+9FFF, whereas at
> the top of the CJK Unified Ideographs code chart it clearly states "Range:
> 4E00–9FAF", and does not show the columns 9FB0-9FBF, 9FC0-9FCF, 9FD0-9FDF,
> 9FE0-9FEF and 9FF0-9FFF. Is there a reason for this discrepancy ?
>
> Given that new CJK unified ideographs are added to supplementary CJK blocks
> (CJK-A, CJK-B and CJK-C), and I understand that no more characters are intended
> to be added to the basic CJK block, why then are U+9FB0 through U+9FFF reserved
> for the CJK Unified Ideographs block ? Surely these eighty code points would be
> better utilised if freed for use by new scripts.

The UTC dealt with this issue of block boundaries back in October, 2001,
in the context of the review of Blocks.txt for Unicode 3.2. There
is mention of this issue and the changes made in Article VII of
UAX #28, Unicode 3.2.

In particular, the inconsistency in block ending range handling for
CJK Unified Ideographs versus the Hangul and Extension A and Extension B
blocks was resolved in favor of ending each block on a "round" hex
boundary, i.e. at XXXF, regardless of whether that was the last character
in the block or not. The extra "space" of reserved code points in
the CJK Unified Ideographs block is an artifact of block decisions made
way back in 1992, well before the BMP looked as tight as it does now.

In case you are interested, the particular anomaly regarding the end of
the CJK Unified Ideographs block versus the header printed in the code
charts is just one of thirteen different types of anomalies that I
analyzed and reported on for the 2001 UTC discussion. Below is the
relevant excerpt.

--Ken

Title: Response to L2/01-419 Block Boundary Fixes
Author: Ken Whistler
Date: October 30, 2001

Mark Davis has suggested a number of fixes to Blocks.txt, to
eliminate some inconsistencies and to try to establish an
invariant that all block boundaries end on an XXXF boundary.
As usual, in all things Unicode-related, there are some
worms (I'm not sure whether they should be considered big
wriggly earthworms or just nematodes) in this can.

So as a response to The Great Innovator (Mark), The
Great Disinnovator (me), has assembled the analysis below of
*all* anomalies in block names. These fall into 13 distinct
types, for each of which I give a separate analysis and
a suggested disposition.

In some instances, I think Mark's suggestions are fine, but
in other cases, I'd rather we left well-enough alone and
abandoned the quest for the invariant.

</quote from L2/01-412>

By the way, I lost that particular argument. The UTC *did*
decide to end all the blocks on an XXXF boundary, and that
change was made for Unicode 3.2. Anyone wanting to examine
the resultant changes in detail can compare:

http://www.unicode.org/Public/3.1-Update/Blocks-4.txt

with

http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt

What follows is my assessment of Anomaly Type #11, which
was the one Andrew was referring to, describing the technical
production reason for the way the header is constructed in
NamesList.txt.

================================================================

TYPE 11: Block ranges match in Unicode and 10646, for
blocks with generated character names, but NamesList.txt
shows a mismatched range.

4E00 CJK Unified Ideographs 9FA5
4E00..9FFF; CJK Unified Ideographs
CJK UNIFIED IDEOGRAPHS 4E00-9FFF

Analysis: The range distinction in NamesList.txt is deliberate,
to enable calculation of the cutoff point in the charts,
where there are no actual character name entries in NamesList.txt
to drive this.

Suggested resolution: No action.

================================================================

</quote from L2/01-412>

Next message: Werner LEMBERG: "Re: [OpenType] PS glyph `phi' vs `phi1'"
Previous message: Marion Gunn: "A new font called Gentium"
Maybe in reply to: Andrew C. West: "CJK Unified Ideographs Range"
Next in thread: Andrew C. West: "Re: CJK Unified Ideographs Range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Feb 19 2003 - 18:25:27 EST