Unicode 5.1 Character Count Statistics Projection

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 05 2007 - 20:09:22 CDT

  • Next message: William J Poser: "Unicode 5.1 Character Count Statistics Projection"

    A propos the discussion about 17 planes, UTF-16, and
    extraterrestrial characters, I have gone ahead and done
    the preliminary calculations on what we can expect in
    terms of numbers of characters for Unicode 5.1, now
    due sometime next spring, based on the current contents
    of Amendments 3 and 4 to 10646:2003.

    Comparing Unicode 5.0 and 5.1 for the main figures
    of concern:
                       5.0 5.1

    BMP characters 52013 53439
    SMP+characters 47007 47315

    Total characters 99020 100754

    Total designated 238667 240401

    Total reserved 875445 873711

    "Characters" here refers to the sum of regular graphic
    characters and Unicode format controls, the "traditional"
    Unicode count.

    "Designated" also includes ISO control codes, noncharacters,
    private use characters, and the surrogate code points.

    "Reserved" is everything else -- the totally unassigned
    code points still available for encoding characters.
    As you can see, we have hardly made a dent in that figure.

    Also, to give you a concrete idea of the current character
    encoding "velocity", if you take the number of characters
    added since the last big anomalous jump in content
    (Extension B in 2001), and average it over the time from
    2001 to the anticipated release of Unicode 5.1 in 2008,
    the per annum character encoding rate for WG2 and the UTC
    is 944 characters/year (and trending down).

    Now we know that some large collections are still to
    go, particularly for the various East Asian ideographic
    collections. In addition to CJK Extensions C and D, there
    is also Old Hanzi (seal script, etc.), Tangut, and Khitan.
    And there are more Egyptian hieroglyphs and Sumerian
    cuneiform to go. Let's take some worst case scenarios
    and assume those all get done in 2008 and all come in
    on the large side:

    CJK Extension C: 4213
    CJK Extension D: 8000
    Old Hanzi: 8000
    Tangut: 5910
    Khitan: 5000
    Yi ideographs: 7000
    Egyptian basic: 1063
    Egyptian ext: 8000
    Cuneiform: 1000

    O.k., that's another 48,186 characters. Let's assign all these
    heavy hitters to allocations, and *then* assume that the
    WG2 and UTC committees will still find enough left over to
    keep plugging away at 1000 characters per year, indefinitely.
    How long have we got?

    (873,711 - 48,186) / 1000 = 825 years

    Oh dear, it looks like I underestimated before when I said it
    would take 800 years to fill the 17 planes.

    Quick, someone get busy on contacting the Orionids!


    This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 20:12:04 CDT