Surrogate points

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 25 2005 - 13:36:39 CST

  • Next message: Hans Aberg: "Re: Actually, this wasn't rhetorical"

    Should not the in effect empty Unicode points, U+D800 to U+DFFF, as well as
    U+FFFE and U+FFFF, be filled with characters? The current construction gives
    a misleading impression that the Unicode character set and character
    numbering have anything to do with the encoding UTF-16.

    From the point of UTF-16, such a change would mean that those numbers cannot
    be mapped to the same 16-bit numbers in the UTF-16 encoding. (One might put
    them at the top of the UTF-16 range.)

    One might also design a new set of encodings for k-bit words, following the
    scheme in UTF-8: A leading bit 0 indicates a single word, mapped to the
    identity. A leading bit 1 indicates a word part of a multiword. The leading
    word starts with a row of 1's followed by a 0, a unary value indicating the
    number of words in the multiword. The trailing words start with 10. The
    other available bits in the multiword are arbitrary, concatenated into a
    binary numbers which is the encoded number, if it is the shortest possible
    multiword expressing this number.

    Call, ad hoc, this encoding UE-k. Then UE-16 has the capacity of holding 27
    bits in a two-word. UE-8 is the same as UTF-8. And UTF-32 is the same as
    UE-32.

    It would suffice to indicate a current range in use, and a predicted limit
    for say the next hundred years. Then it really does not matter whether
    Unicode would ever use any points outside these ranges, as the encodings are
    easily able to handle that. This would also be the official range. Private
    characters can use whatever numbers their owners want outside these ranges,
    and Unicode need not worry about that.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:03:40 CST