Re: Codespace Anxiety Redux (was: Re: Level of Unicode support required ...)

Date: Thu Nov 01 2007 - 22:40:46 CST

  • Next message: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"

    Quoting Kenneth Whistler <>:

    > Once again, just in time for the holidays, the Unicode list
    > has come around again to one of its perennial favorite topics:
    > how 17 planes isn't enough codespace, how software will
    > break when we "inevitably" run out of codes for characters,
    > and what a shame it is to be stuck with such a limited
    > and architecturally flawed construct, given all the 30 bezillion
    > unencoded characters waiting to be encoded.
    >> <vunzndi at vfemail dot net> wrote:
    >> >>> There are advatages to utf-8
    >> >>
    >> >> And many many more advantages to not breaking working code.
    >> >
    >> > And even more to making code hard to break, Y2K, et al.
    >> The 17-plane limit was determined on the basis that the scope of
    >> 10646/Unicode, to encode abstract text characters rather than specific
    >> instances of glyphs, would safely fit within such a limit. To this
    >> date, this has not been proven false.

    Actually the original comments started with a discussion of the size
    of the PUA being to small. It's taken as read the speed unicode works
    at we are safe for a few centuires, however the point was that setting
    such limits in the first place is not he best computing pratice.

    Of course being 'perennial' then the topic has to come up from time top time.

    > Doug has this right, in my opinion.
    > Just yesterday, I posted the first full version of the Unicode
    > names list for early review of Unicode 5.1. My tools report
    > that as having 100,713 graphic and control characters -- including
    > the unlisted but obviously massive numbers of Han characters in
    > the standard.
    > So that's where we stand after 18 *years* of concerted effort,
    > by literally hundreds of people in the character encoding
    > field, to encode every reasonable character that anyone could
    > lay their hands on documentation for.

    > 18 years on, Egyptian hieroglyphs are in their last round
    > of ballotting and are close to getting into the standard.
    > That's 1071 characters, accounting for the basic Gardiner
    > set, some Gardiner extensions, and elements for numerals.
    > Sure there are more Egyptian hieroglyphs out there, but
    > at the rate the Egyptological community is going to move
    > on this, we are unlikely to see more than small extensions
    > of a few dozen here and there for some time to come. And
    > talk of needing a whole plane for Egyptian hieroglyphics
    > is basically Halloween harum-scarum talk.

    The Egyptian hieroglyphs block encodes components - a good way to
    avoid having too many characters.

    > CJK Extension C is also in its last round of ballotting.
    > That now includes 4149 characters -- which *is* a lot of
    > characters compared to most scripts. But the last big
    > chunk of Han that went in was CJK Extension B, 42,711
    > Han characters in March, 2001. What that means is that
    > it has taken the IRG and WG2 7 years to prepare the
    > next 4000 or so Han characters for encoding after
    > Extension B -- which had picked all the low-hanging
    > fruit from the big dictionaries. CJK Extension D will
    > probably show up in less time than Extension C did,
    > given IRG's use of better tools for cross-checking
    > submissions now, but still we are dealing with the difficult
    > long tail of CJK submissions, rather than lots and lots
    > of obvious missing characters.

    Extension C, is a safe subset, a quarter of the characters submitted
    to the IRG in 2002. There are over 20 thousand characters in the IRG's
    pipline, in extension D, and the as yet to be named extension E.

    > Even after CJK Extension C is added to the
    > standard, there are still 16,694 code points on Plane 1
    > and the BMP reserved for CJK unified ideographs.
    > (4DB6..4DBF, 9FC6..9FFF, 2A6D7..2A6FF, and the big
    > chunk for new extensions: 2B735..2F7FF). I don't think
    > I'm going to far out on a limb to suggest that prospective
    > Extensions D and E will fit comfortably in the existing
    > space. It won't be until somebody gets the submissions
    > together for Zhuang sawndip that WG2 will need to crack
    > open the until now unused Plane 3 for Han characters.

    Glad you remebered the Zhuang sawndip project is going forward, and I
    expect will get as to the IRG stage in 2010 with somewhere between 5
    and 10 thousand characters, based only on publication within China
    between 1989 and 2006.

    Before that the China national library may get their IRG submission
    together also in the 5-10 thousand range, and which would be enough to
    start block F, and put Zhuang sawndip into block G.

    Filling up the first half of plane 3 will not be difficult, enough
    characters will be submitted to the IRG by 2015.

    > The other big historic ideographic scripts (Tangut, Jurchen, Khitan)
    > all fit comfortably within Plane 1, with plenty of room
    > to spare. We don't have an accurate count yet for old Yi
    > ideographs, but the unified character encoding for it
    > is likely to be a few 1000's, not in the 10's of thousands --
    > which is the number associated with the paleographic glyph
    > count, not actually distinct characters.

    Agreed in the same way it took 18 years to fill up most of the BMP, it
    will take about 18 years to fill up plane one

    Which means by the time unicode gets to 40, the BMP, and planes 1, 2
    and 3 will be nearly full! palnes 15 and 16 are pua, that means 6/17
    planes in the first 40 years not bad.

    John Knightley

    > 0.0.
    >> Code that uses UTF-16, SCSU, or other encoding forms that assume the
    >> 17-plane limit are not broken, or break-prone, in the same sense as code
    >> written under the assumption it would be replaced or upgraded before the
    >> turn of the century.
    > Yep.
    > --Ken

    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 06:13:25 CST