Re: How many characters?

From: Mark Davis (mark.davis@icu-project.org)
Date: Wed Nov 23 2005 - 16:41:37 CST

  • Next message: Richard Wordingham: "Re: ZWNJ in IDN (Burmese Issues)"

    These figures depend on what precisely is meant by the label. For
    example, if Han Compatibility is taken as meaning:

    Ideographic=True and Decomposition_Type!=None

    Then divided by BMP or SMP, that gives (U4.1):

    [[:ideographic:]&[:^decomposition_type=none:]&[\u0000-\uFFFF]]
    399 Code Points

    [[:ideographic:]&[:^decomposition_type=none:]&[^\u0000-\uFFFF]]
    542 Code Points

    for a total of 941 Code Points. However, that includes 3 characters not
    called CJK compatibility in their names. Or it could be going by the
    block name (and then excluding unassigned code points).

    Similarly, the label Alphabetics and Symbols is not actually Alphabetic
    union Symbol: it is really (I guess) for the BMP

    [[:gc=letter:][:gc=number:][:gc=symbol:][:gc=mark:][:gc=punctuation:][:gc=separator:]&[\u0000-\uFFFF]]
    minus the other listed stuff:
    Han (URO), Han Extension A, Han Compatibility, Hangul Syllables.

    Here is the breakdown I get for 4.1, using the main properties that we
    list in Chapter 2.

            BMP SMP All
    [:gc=letter:] 46,618 44,777 91,395
    [:gc=number:] 514 181 695
    [:gc=symbol:] 3,339 619 3,958
    [:gc=mark:] 723 286 1,009
    [:gc=punctuation:] 428 12 440
    [:gc=separator:] 20 0 20
    *Subtotal:* *51,642* *45,875* *97,517*
    [:gc=control:] 65 0 65
    [:gc=format:] 33 105 138
    [:gc=private-use:] 6,400 131,068 137,468
    [:gc=surrogate:] 2,048 0 2,048
    [:noncharacter_code_point:] 34 32 66
    [[:unassigned:]-[:noncharacter_code_point:]] 5,314 871,496 876,810
    *Total* *65,536* *1,048,576* *1,114,112*

    **
    Mark

    BTW, we really should stop dispense with the now-artificial distinction
    between SMP and BMP for the figures nowadays.

    Asmus Freytag wrote:

    > On 11/23/2005 7:43 AM, Peter Constable wrote:
    >
    >> By my calculations, both you and Ken have errors in your 4.1
    >> statistics.
    >> Re the BMP: Doing a hand count of Cf characters in TUS4.1, I come up
    >> with 33. Not 31, not 35.
    >>
    > I also find 33, as follows:
    >
    > C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
    > [A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
    > codeData.txt | findstr ;Cf; | wc
    > 33 96 1555
    >
    > The first 'findstr' limits the search to BMP (4 digit code points),
    > the second searches for Cf in the result, and the 'wc' counts the
    > lines found.
    >
    >> And I came up with the following counts for graphic characters in
    >> Unicode 4.1:
    >>
    >> Alphabetics, Symbols: 12,497
    >>
    >>
    > I find 12,964 characters that are [LMNPSZ][a-z], which, given 467 Han
    > Compat (not 457)
    > gives 12,497
    >
    >> Han (URO): 20,924
    >> Han Extension A: 6,582
    >> Han Compatibility: 457
    >>
    >>
    > I find 467 as follows:
    >
    > C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
    > [A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
    > codeData.txt | findstr IDEOGRAPH- | wc
    > 467 1401 27980
    >
    > These are all the characters with "IDEOGRAPH-" in their name
    >
    >> Hangul Syllables: 11,172
    >> Total Graphic characters: 51,642
    >>
    >>
    >> Re the supplementary planes: My numbers agree with yours.
    >>
    >> Overall, then, I believe the correct numbers for TUS4.1 are as follows:
    >>
    >> Unicode 4.1:
    >>
    >> 51642 graphic characters assigned (BMP)
    >> 33 format control characters assigned (BMP)
    >> 65 control characters assigned (BMP)
    >> 6400 private use characters assigned (BMP)
    >> 2048 surrogate code points designated (BMP)
    >> 34 noncharacter code points designated (BMP)
    >> 5314 reserved code points (BMP)
    >> 45875 graphic characters assigned (supplementary planes)
    >> 105 format characters assigned (supplementary planes)
    >> 131068 private use characters assigned (supplementary planes)
    >> 32 noncharacter code points designated (supplementary planes)
    >> 871496 reserved code points (supplementary planes)
    >> ------------------------------------------------------------------
    >> 1114112 code points altogether
    >>
    >>
    >> I haven't looked at 5.0 numbers; let's see if we can agree on 4.1
    >> numbers, though.
    >>
    >>
    >> Peter Constable
    >>
    >>
    >>
    >>
    >>> -----Original Message-----
    >>> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    >>> Behalf Of Andrew West
    >>> Sent: Wednesday, November 23, 2005 4:26 AM
    >>> To: unicode@unicode.org
    >>> Subject: Re: How many characters?
    >>>
    >>> On 22/11/05, Kenneth Whistler <kenw@sybase.com> wrote:
    >>>
    >>>
    >>>> Unicode 4.1:
    >>>>
    >>>> 51644 graphic characters assigned (BMP)
    >>>> 31 format control characters assigned (BMP)
    >>>> 65 control characters assigned (BMP)
    >>>> 6400 private use characters assigned (BMP)
    >>>> 2048 surrogate code points designated (BMP)
    >>>> 34 noncharacter code points designated (BMP)
    >>>> 5314 reserved code points (BMP)
    >>>> 45980 graphic characters assigned (supplementary planes)
    >>>> 131068 private use characters assigned (supplementary planes)
    >>>> 32 noncharacter code points designated (supplementary planes)
    >>>> 871496 reserved code points (supplementary planes)
    >>>> ------------------------------------------------------------------
    >>>> 1114112 code points altogether
    >>>>
    >>>> Unicode 5.0:
    >>>>
    >>>> 51986 graphic characters assigned (BMP)
    >>>> 31 format control characters assigned (BMP)
    >>>> 65 control characters assigned (BMP)
    >>>> 6400 private use characters assigned (BMP)
    >>>> 2048 surrogate code points designated (BMP)
    >>>> 34 noncharacter code points designated (BMP)
    >>>> 4972 reserved code points (BMP)
    >>>> 47007 graphic characters assigned (supplementary planes)
    >>>> 131068 private use characters assigned (supplementary planes)
    >>>> 32 noncharacter code points designated (supplementary planes)
    >>>> 870469 reserved code points (supplementary planes)
    >>>> ------------------------------------------------------------------
    >>>> 1114112 code points altogether
    >>>>
    >>>>
    >>>
    >>> Ken may perhaps have forgotten that the 4.0 figures wrongly count five
    >>> format characters as graphic characters, and so after adjusting for
    >>> the longstanding out by two error the 4.1 figures for format
    >>> characters are still out by four due to the change in GC of U+200B to
    >>> Cf in 4.0.1. By my calculations the correct values for 4.1 are:
    >>>
    >>> Unicode 4.1:
    >>>
    >>> 51640 graphic characters assigned (BMP)
    >>> 35 format control characters assigned (BMP)
    >>> 65 control characters assigned (BMP)
    >>> 6400 private use characters assigned (BMP)
    >>> 2048 surrogate code points designated (BMP)
    >>> 34 noncharacter code points designated (BMP)
    >>> 5314 reserved code points (BMP)
    >>> 45875 graphic characters assigned (supplementary planes)
    >>> 105 format characters assigned (supplementary planes)
    >>> 131068 private use characters assigned (supplementary planes)
    >>> 32 noncharacter code points designated (supplementary planes)
    >>> 871496 reserved code points (supplementary planes)
    >>> ------------------------------------------------------------------
    >>> 1114112 code points altogether
    >>>
    >>> Based on the latest publicly available version of the 5.0 UCD data, I
    >>> get the following figures for 5.0. My figures have two less BMP and
    >>> two more SMP characters than Ken's figures, but I haven't
    >>> cross-checked with N2991 yet (N2991 states there are 1,359 new
    >>> characters, but this must be a typo for 1,369), so I'm not sure who's
    >>> correct.
    >>>
    >>> Unicode 5.0:
    >>>
    >>> 51980 graphic characters assigned (BMP)
    >>> 35 format control characters assigned (BMP)
    >>> 65 control characters assigned (BMP)
    >>> 6400 private use characters assigned (BMP)
    >>> 2048 surrogate code points designated (BMP)
    >>> 34 noncharacter code points designated (BMP)
    >>> 4974 reserved code points (BMP)
    >>> 46904 graphic characters assigned (supplementary planes)
    >>> 105 format characters assigned (supplementary planes)
    >>> 131068 private use characters assigned (supplementary planes)
    >>> 32 noncharacter code points designated (supplementary planes)
    >>> 870467 reserved code points (supplementary planes)
    >>> ------------------------------------------------------------------
    >>> 1114112 code points altogether
    >>>
    >>> Andrew
    >>>
    >>>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 16:42:33 CST