Character numbers and encodings

From: Hans Aberg (haberg@math.su.se)
Date: Wed Feb 02 2005 - 11:57:39 CST

Next message: Peter Kirk: "Re: Character numbers and encodings"

Previous message: D. Starner: "RE: Surrogate points"
Next in thread: Peter Kirk: "Re: Character numbers and encodings"
Reply: Peter Kirk: "Re: Character numbers and encodings"
Maybe reply: Hans Aberg: "Re: Character numbers and encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Since some expressed difficulties to see how the Unicode points (character
numberings) could be separated from the encodings, I will indicate a way to
extend UTF-16 to cover all non-negative integer values. The intellectual
spin-off is to separate characters, character numberings and encodings into
independent structures. The intent is not to provide an efficient or
practical implementation or suggestion for Unicode, even though one can see
how easily such changes could be made from this description. (The procedure
is similar to that of mangling identifiers in computer languages.)

So first focus on the set of characters. Assume, to have something to focus
on, that each character has a unique identifier, freed from its current
Unicode code point. It could be say its defining string, even though that
would not entirely work for current Unicode. Then assign to each character a
non-negative integer. It is irrelevant if there are more than one numbering
schemes, but focus on one at a time. It is also irrelevant which numbers are
covered. So if we work with UTF-16, the surrogate points could be covered.
Assume that, if that helps thinking.

Reserve one empty slot in UTF-16, and let it indicate that the multiword
that follows is a UE-16, which I defined elsewhere (in the thread
"Surroogate points"). UE-k covers (k-1)(k-2) bits. If one wants to retain
the resynchronization property, instead use a modified UE'-16 where the
leading words starts with "10..." instead of "11...". Then modify this
encoding a bit further into a recursively defined UE''-k by saying that the
highest encoded number indicates that the multiwords that follows is a
UE''-2k, where the 2k-bit words are written out as (say) two big endian
k-bit words. This process can clearly be extended to cover any non-negative
number. For each non-negative number, its encoding is the chosrtest
mutlit-word.

So, it does not matter, from this theoretical point of view, how the Unicode
characters are assigned to non-negative integers. From the practical point
of view, it might make implementation of programs easier if these numbers
are held together. But it might also help implementation to have the
character numbers being logically groups, so that common character classes
can easily be identified as say a small number of intervals. One could also
easily use more than one character numbering, as encodings and numberings
can freely be combined.

Hans Aberg

Next message: Peter Kirk: "Re: Character numbers and encodings"
Previous message: D. Starner: "RE: Surrogate points"
Next in thread: Peter Kirk: "Re: Character numbers and encodings"
Reply: Peter Kirk: "Re: Character numbers and encodings"
Maybe reply: Hans Aberg: "Re: Character numbers and encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Feb 02 2005 - 12:02:00 CST