Character numbers and encodings

From: Hans Aberg (haberg@math.su.se)
Date: Wed Feb 02 2005 - 11:57:39 CST

  • Next message: Peter Kirk: "Re: Character numbers and encodings"

    Since some expressed difficulties to see how the Unicode points (character
    numberings) could be separated from the encodings, I will indicate a way to
    extend UTF-16 to cover all non-negative integer values. The intellectual
    spin-off is to separate characters, character numberings and encodings into
    independent structures. The intent is not to provide an efficient or
    practical implementation or suggestion for Unicode, even though one can see
    how easily such changes could be made from this description. (The procedure
    is similar to that of mangling identifiers in computer languages.)

    So first focus on the set of characters. Assume, to have something to focus
    on, that each character has a unique identifier, freed from its current
    Unicode code point. It could be say its defining string, even though that
    would not entirely work for current Unicode. Then assign to each character a
    non-negative integer. It is irrelevant if there are more than one numbering
    schemes, but focus on one at a time. It is also irrelevant which numbers are
    covered. So if we work with UTF-16, the surrogate points could be covered.
    Assume that, if that helps thinking.

    Reserve one empty slot in UTF-16, and let it indicate that the multiword
    that follows is a UE-16, which I defined elsewhere (in the thread
    "Surroogate points"). UE-k covers (k-1)(k-2) bits. If one wants to retain
    the resynchronization property, instead use a modified UE'-16 where the
    leading words starts with "10..." instead of "11...". Then modify this
    encoding a bit further into a recursively defined UE''-k by saying that the
    highest encoded number indicates that the multiwords that follows is a
    UE''-2k, where the 2k-bit words are written out as (say) two big endian
    k-bit words. This process can clearly be extended to cover any non-negative
    number. For each non-negative number, its encoding is the chosrtest
    mutlit-word.

    So, it does not matter, from this theoretical point of view, how the Unicode
    characters are assigned to non-negative integers. From the practical point
    of view, it might make implementation of programs easier if these numbers
    are held together. But it might also help implementation to have the
    character numbers being logically groups, so that common character classes
    can easily be identified as say a small number of intervals. One could also
    easily use more than one character numbering, as encodings and numberings
    can freely be combined.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Feb 02 2005 - 12:02:00 CST