Surrogate points

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 25 2005 - 13:36:39 CST

Next message: Hans Aberg: "Re: Actually, this wasn't rhetorical"

Previous message: Lars Kristan: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Next in thread: Markus Scherer: "Re: Surrogate points"
Reply: Markus Scherer: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Rick McGowan: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "RE: Surrogate points"
Maybe reply: Rick McGowan: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Jon Hanna: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Peter Constable: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Peter Constable: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Should not the in effect empty Unicode points, U+D800 to U+DFFF, as well as
U+FFFE and U+FFFF, be filled with characters? The current construction gives
a misleading impression that the Unicode character set and character
numbering have anything to do with the encoding UTF-16.

From the point of UTF-16, such a change would mean that those numbers cannot
be mapped to the same 16-bit numbers in the UTF-16 encoding. (One might put
them at the top of the UTF-16 range.)

One might also design a new set of encodings for k-bit words, following the
scheme in UTF-8: A leading bit 0 indicates a single word, mapped to the
identity. A leading bit 1 indicates a word part of a multiword. The leading
word starts with a row of 1's followed by a 0, a unary value indicating the
number of words in the multiword. The trailing words start with 10. The
other available bits in the multiword are arbitrary, concatenated into a
binary numbers which is the encoded number, if it is the shortest possible
multiword expressing this number.

Call, ad hoc, this encoding UE-k. Then UE-16 has the capacity of holding 27
bits in a two-word. UE-8 is the same as UTF-8. And UTF-32 is the same as
UE-32.

It would suffice to indicate a current range in use, and a predicted limit
for say the next hundred years. Then it really does not matter whether
Unicode would ever use any points outside these ranges, as the encodings are
easily able to handle that. This would also be the official range. Private
characters can use whatever numbers their owners want outside these ranges,
and Unicode need not worry about that.

Hans Aberg

Next message: Hans Aberg: "Re: Actually, this wasn't rhetorical"
Previous message: Lars Kristan: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Next in thread: Markus Scherer: "Re: Surrogate points"
Reply: Markus Scherer: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Rick McGowan: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "RE: Surrogate points"
Maybe reply: Rick McGowan: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Jon Hanna: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Peter Constable: "RE: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Hans Aberg: "Re: Surrogate points"
Maybe reply: Peter Constable: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:03:40 CST