Re: How many characters?

From: Mark Davis (mark.davis@icu-project.org)
Date: Wed Nov 23 2005 - 19:04:04 CST

  • Next message: Ngwe Tun: "Re: ZWNJ in IDN (Burmese Issues)"

    >This is incorrect, and is making the same mistake that Peter made
    >first, and then Asmus.
    >
    >Format controls as defined in Chapter 2 are the union of
    >gc=Cf and gc=Zl and gc=Zp.
    >
    It looks like I didn't make my main point clear. It is that if we have
    any table with labeled numbers, then we need to make sure that the
    meanings of each of the labels is precisely defined in terms of the UCD
    properties (and easily accessed from that table). If so many reasonable
    people make the same mistake, that tells us that there is a problem,
    Houston ;-)

    >>BTW, we really should stop dispense with the now-artificial distinction
    >>between SMP and BMP for the figures nowadays.
    >
    >
    > I disagree.
    >
    >The distinction is a real, not artificial one.

    The distinction is real -- and so are a thousand other distinctions we
    could make. The question is whether it is the most useful distinction to
    draw -- and I don't think it is any more, particularly. I think people'd
    find it more useful to break down the figures among the Alphabetics and
    Symbols -- eg, that there are (in U4.1) a total of 1,009 combining marks
    and a total of 3,958 symbols; that there are a total of 270 decimal
    digits and 425 other gc=Number characters -- those to me means more than
    the fact that of the 140 format characters, 35 happen to be below U+FFFF
    and 105 happen to be above it.

    But you and I are perhaps not the typical audience; it would be
    productive to see what other people on this list would find the most useful.

    Mark

    Kenneth Whistler wrote:

    >Mark said:
    >
    >
    >
    >>These figures depend on what precisely is meant by the label.
    >>
    >>
    >
    >Of course, but the labels have been intended to have precise
    >meanings since we first started publishing historical lists of the
    >"Number of Assigned Characters" in Unicode 3.0 back in 2000.
    >It isn't as if we can just wave our arms around, say the
    >labels mean whatever somebody might decide they mean, and
    >then change the statistics every year based on that.
    >
    >
    >
    >>For
    >>example, if Han Compatibility is taken as meaning:
    >>
    >>Ideographic=True and Decomposition_Type!=None
    >>
    >>Then divided by BMP or SMP, that gives (U4.1):
    >>
    >>[[:ideographic:]&[:^decomposition_type=none:]&[\u0000-\uFFFF]]
    >>399 Code Points
    >>
    >>[[:ideographic:]&[:^decomposition_type=none:]&[^\u0000-\uFFFF]]
    >>542 Code Points
    >>
    >>for a total of 941 Code Points. However, that includes 3 characters not
    >>called CJK compatibility in their names. Or it could be going by the
    >>block name (and then excluding unassigned code points).
    >>
    >>
    >
    >But of course, "Han Compatibility" in the stats doesn't mean the former.
    >It means and always has meant the number of assigned characters in
    >the following two blocks:
    >
    >F900..FAFF; CJK Compatibility Ideographs
    >2F800..2FA1F; CJK Compatibility Ideographs Supplement
    >
    >despite the fact that 12 of the ideographs in the CJK Compatibility
    >Ideographs block have the Unified_Ideograph property
    >
    >despite the fact that there are a number of characters outside
    >of those blocks which have the Ideographic property, some of
    >which also have decompositions
    >
    >
    >
    >>Similarly, the label Alphabetics and Symbols is not actually Alphabetic
    >>union Symbol: it is really (I guess) for the BMP
    >>
    >>
    > ^^^^^^^^^
    >
    >No guessing necessary, since "Graphic" in Tables D-2 and Tables D-3
    >of Unicode 4.0 was quite consciously and deliberately aligned with
    >Table 2-2. And the "Alphabetic, Symbols" line is part of the summation
    >of values that leads to Graphic characters as a subtotal.
    >
    >
    >
    >>
    >>
    >[[:gc=letter:][:gc=number:][:gc=symbol:][:gc=mark:][:gc=punctuation:][:gc=separa
    >tor:]&[\u0000-\uFFFF]]
    >
    >
    >>minus the other listed stuff:
    >>Han (URO), Han Extension A, Han Compatibility, Hangul Syllables.
    >>
    >>Here is the breakdown I get for 4.1, using the main properties that we
    >>list in Chapter 2.
    >>
    >>
    >> BMP SMP All
    >>[:gc=letter:] 46,618 44,777 91,395
    >>[:gc=number:] 514 181 695
    >>[:gc=symbol:] 3,339 619 3,958
    >>[:gc=mark:] 723 286 1,009
    >>[:gc=punctuation:] 428 12 440
    >>[:gc=separator:] 20 0 20
    >>*Subtotal:* *51,642* *45,875* *97,517*
    >>
    >>
    >
    >This subtotal is incorrect (as a count of Graphic
    >characters), because it doesn't distinguish
    >those [:gc=separator:] which *are* Graphic characters from
    >those which are not.
    >
    >
    >
    >>[:gc=control:] 65 0 65
    >>[:gc=format:] 33 105 138
    >>
    >>
    >
    >This is incorrect, and is making the same mistake that Peter made
    >first, and then Asmus.
    >
    >Format controls as defined in Chapter 2 are the union of
    >gc=Cf and gc=Zl and gc=Zp.
    >
    >
    >
    >
    >>**
    >>Mark
    >>
    >>BTW, we really should stop dispense with the now-artificial distinction
    >>between SMP and BMP for the figures nowadays.
    >>
    >>
    >
    >I disagree.
    >
    >The distinction is a real, not artificial one. And while there
    >are rhetorical reasons for deemphasizing it and encouraging everyone
    >to implement all code points equally, I am not in favor if
    >scrubbing the stats of the differences.
    >
    >--Ken
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 19:08:50 CST