Re: How many characters?

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Nov 23 2005 - 14:01:57 CST

  • Next message: Manu unni: "Hello"

    On 11/23/2005 7:43 AM, Peter Constable wrote:

    >By my calculations, both you and Ken have errors in your 4.1 statistics.
    >
    >Re the BMP: Doing a hand count of Cf characters in TUS4.1, I come up with 33. Not 31, not 35.
    >
    I also find 33, as follows:

    C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
    [A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
    codeData.txt | findstr ;Cf; | wc
             33 96 1555

    The first 'findstr' limits the search to BMP (4 digit code points), the
    second searches for Cf in the result, and the 'wc' counts the lines found.

    > And I came up with the following counts for graphic characters in Unicode 4.1:
    >
    >Alphabetics, Symbols: 12,497
    >
    >
    I find 12,964 characters that are [LMNPSZ][a-z], which, given 467 Han
    Compat (not 457)
    gives 12,497

    >Han (URO): 20,924
    >Han Extension A: 6,582
    >Han Compatibility: 457
    >
    >
    I find 467 as follows:

    C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
    [A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
    codeData.txt | findstr IDEOGRAPH- | wc
            467 1401 27980

    These are all the characters with "IDEOGRAPH-" in their name

    >Hangul Syllables: 11,172
    >Total Graphic characters: 51,642
    >
    >
    >Re the supplementary planes: My numbers agree with yours.
    >
    >Overall, then, I believe the correct numbers for TUS4.1 are as follows:
    >
    >Unicode 4.1:
    >
    > 51642 graphic characters assigned (BMP)
    > 33 format control characters assigned (BMP)
    > 65 control characters assigned (BMP)
    > 6400 private use characters assigned (BMP)
    > 2048 surrogate code points designated (BMP)
    > 34 noncharacter code points designated (BMP)
    > 5314 reserved code points (BMP)
    > 45875 graphic characters assigned (supplementary planes)
    > 105 format characters assigned (supplementary planes)
    > 131068 private use characters assigned (supplementary planes)
    > 32 noncharacter code points designated (supplementary planes)
    > 871496 reserved code points (supplementary planes)
    >------------------------------------------------------------------
    > 1114112 code points altogether
    >
    >
    >I haven't looked at 5.0 numbers; let's see if we can agree on 4.1 numbers, though.
    >
    >
    >Peter Constable
    >
    >
    >
    >
    >>-----Original Message-----
    >>From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    >>Behalf Of Andrew West
    >>Sent: Wednesday, November 23, 2005 4:26 AM
    >>To: unicode@unicode.org
    >>Subject: Re: How many characters?
    >>
    >>On 22/11/05, Kenneth Whistler <kenw@sybase.com> wrote:
    >>
    >>
    >>>Unicode 4.1:
    >>>
    >>> 51644 graphic characters assigned (BMP)
    >>> 31 format control characters assigned (BMP)
    >>> 65 control characters assigned (BMP)
    >>> 6400 private use characters assigned (BMP)
    >>> 2048 surrogate code points designated (BMP)
    >>> 34 noncharacter code points designated (BMP)
    >>> 5314 reserved code points (BMP)
    >>> 45980 graphic characters assigned (supplementary planes)
    >>> 131068 private use characters assigned (supplementary planes)
    >>> 32 noncharacter code points designated (supplementary planes)
    >>> 871496 reserved code points (supplementary planes)
    >>>------------------------------------------------------------------
    >>>1114112 code points altogether
    >>>
    >>>Unicode 5.0:
    >>>
    >>> 51986 graphic characters assigned (BMP)
    >>> 31 format control characters assigned (BMP)
    >>> 65 control characters assigned (BMP)
    >>> 6400 private use characters assigned (BMP)
    >>> 2048 surrogate code points designated (BMP)
    >>> 34 noncharacter code points designated (BMP)
    >>> 4972 reserved code points (BMP)
    >>> 47007 graphic characters assigned (supplementary planes)
    >>> 131068 private use characters assigned (supplementary planes)
    >>> 32 noncharacter code points designated (supplementary planes)
    >>> 870469 reserved code points (supplementary planes)
    >>>------------------------------------------------------------------
    >>>1114112 code points altogether
    >>>
    >>>
    >>>
    >>Ken may perhaps have forgotten that the 4.0 figures wrongly count five
    >>format characters as graphic characters, and so after adjusting for
    >>the longstanding out by two error the 4.1 figures for format
    >>characters are still out by four due to the change in GC of U+200B to
    >>Cf in 4.0.1. By my calculations the correct values for 4.1 are:
    >>
    >>Unicode 4.1:
    >>
    >> 51640 graphic characters assigned (BMP)
    >> 35 format control characters assigned (BMP)
    >> 65 control characters assigned (BMP)
    >> 6400 private use characters assigned (BMP)
    >> 2048 surrogate code points designated (BMP)
    >> 34 noncharacter code points designated (BMP)
    >> 5314 reserved code points (BMP)
    >> 45875 graphic characters assigned (supplementary planes)
    >> 105 format characters assigned (supplementary planes)
    >>131068 private use characters assigned (supplementary planes)
    >> 32 noncharacter code points designated (supplementary planes)
    >>871496 reserved code points (supplementary planes)
    >>------------------------------------------------------------------
    >>1114112 code points altogether
    >>
    >>Based on the latest publicly available version of the 5.0 UCD data, I
    >>get the following figures for 5.0. My figures have two less BMP and
    >>two more SMP characters than Ken's figures, but I haven't
    >>cross-checked with N2991 yet (N2991 states there are 1,359 new
    >>characters, but this must be a typo for 1,369), so I'm not sure who's
    >>correct.
    >>
    >>Unicode 5.0:
    >>
    >> 51980 graphic characters assigned (BMP)
    >> 35 format control characters assigned (BMP)
    >> 65 control characters assigned (BMP)
    >> 6400 private use characters assigned (BMP)
    >> 2048 surrogate code points designated (BMP)
    >> 34 noncharacter code points designated (BMP)
    >> 4974 reserved code points (BMP)
    >> 46904 graphic characters assigned (supplementary planes)
    >> 105 format characters assigned (supplementary planes)
    >>131068 private use characters assigned (supplementary planes)
    >> 32 noncharacter code points designated (supplementary planes)
    >>870467 reserved code points (supplementary planes)
    >>------------------------------------------------------------------
    >>1114112 code points altogether
    >>
    >>Andrew
    >>
    >>
    >>
    >
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 14:03:02 CST