From: Philippe Verdy (email@example.com)
Date: Sat May 17 2003 - 14:29:19 EDT
> Philippe Verdy wrote on 05/15/2003 05:15:08 PM:
> My term to beat up -- err, correct -- Philippe:
> > > One Unicode character, i.e., one Unicode Scalar Value.
> > More exactly one 21-bit codepoint.
> Strictly speaking, Unicode codepoints do not have a size in terms of bits.
> There are simply positive integers.
I don't think so, the agreement between Unicode and ISO/IEC 10646 makes a clear statement that only 17 planes will be allocated, and definitely fixes the range of allowed values for codepoints, so they need only 21 bits to encode it (an application that would use the UTF-32 encoding form could store internally characters in 3 bytes only instead of 4 without breaking the model, and it's a shame that an UTF-24BE or UTF24LE encoding scheme was not clearly defined, because the high byte in the UTF-32* encoding schemes will always be 0!)
There could as well exist in the future some 21-bit storage code units directly for text processing or in some serial protocols to save those extra unnecessary bits used in the UCS4 or UTF-32 encoding forms.
Also, the encoding schemes are quite artificial: there are many cases where byte order is not really significant, as only bit order is useful as there's no concrete "byte" grouping for transfers (look for example the format of synchronous networking frames); transfer bits are only reordered as bytes at an upper level of the networking spec, only to map bits into the minimum addressable memory units. Some systems have memory units that are NOT multiple of 8 bits, and a 21 bits specification could allow a 7-bit system to use optimally three 7-bit memory cells to store a single codepoint, and there could even exist binary instructions that handle directly 3 addressable cells in one operation. To store 1 byte, such system would need two memory cells as well.
So what Unicode defines as "code units" assumes a processing architecture, which may not map easily in the future with other possible processing models (notably if a new processing technology can use ternary logic).
I do agree that saying that Unicode only uses 21 bits is too restrictives and asumes a binary processing model too. So it's may be more clear that Unicode uses codepoints in a well-defined range [0..0x10FFFF] and nothing more. This model, for compatibility reasons, will not be extensible without defining a new concept distinct from code points.
For example "code points" could be considered (in a later specification) as a canonically decomposed representation of "abstract characters" that could be better described with a unique integer in a larger integer range, with these integers named differently such as "abstract code points". So the current mapping between abstract characters and code points that currently defines Unicode could evolve to a better model as well with an additional abstraction level.
This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 15:09:40 EDT