Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Paul Keinanen (keinanen@sci.fi)
Date: Fri Feb 23 2001 - 02:59:01 EST

Next message: Marco Cimarosti: "RE: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Previous message: Joel Rees: "Re: fictional scripts revisited"
Maybe in reply to: Tom Lord: "An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Next in thread: Mark Davis: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 22 Feb 2001 11:51:31 -0800 (GMT-0800), Markus Scherer
<markus.scherer@jtcsv.com> wrote:

>Tom Lord wrote:
>> Two code points represent non-characters. These are U+FFFE and
>> U+FFFF. Programs are free to give these values special meaning
>> internally.
>
>Unicode (2.0 and up?) has 34 non-characters at U+xxFFFE and U+xxFFFF where xx is 00, 01, .., 0F, 10.
>Unicode 3.1 is adding another 32 non-characters on the BMP. See UTR 27 for details.

I think it should be time to declare the range U+D800 .. U+DFFF as
reserved non-characters and strictly limit any references to surrogate
pairs into the context of the UTF-16 transformation. This will reduce
the confusion surrounding the surrogates.

If someone chose UTF-16 as their internal representation, then that is
their problem and the UTF-16 peculiarities should not clutter the
discussion in various places. In my opinion, the situation is
completely analogous to the situation of using UTF-8 as the internal
representation. Also with UTF-8, there are illegal sequences and gaps
in the internal byte ranges (due to the non-shortest sequence rule) as
is the case with UTF-16 (surrogate pairs).

Regarding how to describe Unicode in the public, I think it is best to
say that it can encode more than a million characters, of which about
100000 (in 3.1) is used. It is better to defer the discussion of any
transformation forms to a much later stage.

Paul

Next message: Marco Cimarosti: "RE: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Previous message: Joel Rees: "Re: fictional scripts revisited"
Maybe in reply to: Tom Lord: "An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Next in thread: Mark Davis: "Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT