Wasting Planes (was: RE: What is the principle?)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 31 2004 - 18:16:03 EST

  • Next message: Peter Kirk: "Re: What is the principle?"

    > Surely Unicode didn't waste two planes for something that
    > no one can practically use.

    Plane 15 and Plane 16 private use characters weren't the
    invention of the UTC, by the way. They derive from the
    original specification of ISO/IEC 10646-1. From
    ISO/IEC 10646-1: 1993:

    "The code positions of 32 planes from Plane E0 to Plane FF
    of Group 00 shall be for Private Use.

    "The code positions of the 32 groups from Group 60 to Group 7F
    shall be for Private Use."

    That would have been:

       U-00E00000..U-00FFFFFD
       U-60000000..U-7FFFFFFD
       
    That was 8224 *planes* of private use code positions.

    Amendment 1 (the one that defined UTF-16) amended that to
    read:

    "The code positions of the 32 groups from Group 60 to
    Group 7F shall be for private use.

    "The code positions of Plane 0F and Plane 10, and of the
    32 planes from Plane E0 to Plane FF, of Group 00 shall
    be for private use.

    "The 6400 code positions E000 to F8FF of the Basic
    Multilingual Plane shall be for private use."

    That was 8226 *planes* of private use code positions,
    besides the 6400 code positions on the BMP (which had
    been defined earlier, but not spelled out in the same
    clause with the rest of the private use allocation).
    The addition of Plane 0F and Plane 10 was so there were
    some private use planes accessible via UTF-16.

    In that grand proliferation of "wastage", 10646 allowed for
    539,089,084 private use code positions. That was a wee
    tad more than anyone actually needed to use, by the way.

    More recent amendments to 10646 have simply settled on
    the principle that *all* code positions beyond U-0010FFFF
    are reserved, leaving the 6400 private use code positions
    on the BMP, plus Plane 0F and Plane 10. In the grand scheme
    of things, that seems to be the Goldilocks solution -- not
    too small (6400) and not too big (539,089,084) -- but juuuust
    right (137,468).

    There are people who have valid reasons for making use
    of Plane 0F or Plane 10 private use characters, by the
    way, but most of those reasons have to do with CJK. And
    the reason for that should be pretty obvious -- only the
    CJK script deals with the kind of entity numbers (multiple
    10's of thousands) that make the 6400 code points of
    the BMP PUA seem cramped. *Any* other unencoded script,
    for example, with the possible exceptions of Egyptian
    hieroglyphics or Tangut ideographs, would fit into the
    BMP PUA with plenty of room to spare.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Mar 31 2004 - 18:54:19 EST