Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Paul Keinanen (
Date: Fri Feb 23 2001 - 02:59:01 EST

On Thu, 22 Feb 2001 11:51:31 -0800 (GMT-0800), Markus Scherer
<> wrote:

>Tom Lord wrote:
>> Two code points represent non-characters. These are U+FFFE and
>> U+FFFF. Programs are free to give these values special meaning
>> internally.
>Unicode (2.0 and up?) has 34 non-characters at U+xxFFFE and U+xxFFFF where xx is 00, 01, .., 0F, 10.
>Unicode 3.1 is adding another 32 non-characters on the BMP. See UTR 27 for details.

I think it should be time to declare the range U+D800 .. U+DFFF as
reserved non-characters and strictly limit any references to surrogate
pairs into the context of the UTF-16 transformation. This will reduce
the confusion surrounding the surrogates.

If someone chose UTF-16 as their internal representation, then that is
their problem and the UTF-16 peculiarities should not clutter the
discussion in various places. In my opinion, the situation is
completely analogous to the situation of using UTF-8 as the internal
representation. Also with UTF-8, there are illegal sequences and gaps
in the internal byte ranges (due to the non-shortest sequence rule) as
is the case with UTF-16 (surrogate pairs).

Regarding how to describe Unicode in the public, I think it is best to
say that it can encode more than a million characters, of which about
100000 (in 3.1) is used. It is better to defer the discussion of any
transformation forms to a much later stage.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT