Re: Surrogate points

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 31 2005 - 05:09:47 CST

Next message: Hans Aberg: "Re: Surrogate points"

Previous message: Eric Carwardine: "Arabic and HTML"
In reply to: Doug Ewell: "Re: Surrogate points"
Next in thread: Peter Constable: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 30/01/2005 22:18, Doug Ewell wrote:

>Hans Aberg <haberg at math dot su dot se> wrote:
>
>
>
>>The numbers 0xD800-0xDFFF, 0xFFFE-0xFFFF are not associated with
>>character, but included as place holders, never to be used, because
>>one has failed to give the encoding UTF-16 a proper design. So an
>>unrelated problem, choice of character encoding, is allowed to
>>influence the logical core, the character set description.
>>
>>
>
>...
>
>In any case, it is incorrect to state that the choice of this block was
>due to "failure to given UTF-16 a proper design." Other blocks, such as
>the "obvious" 0xF800 through 0xFFFF, were already occupied.
>
>
>
Doug, I think you have missed Hans' point, which is surely that if
Unicode had been designed from the start as a 21-bit space or whatever,
it is unlikely that this surrogate pair mechanism would have been used
to encode characters beyond the first 64K, and there would not have
been a need to reserve this large block of code points. Instead I would
guess that a mechanism more like UTF-8 would have been introduced, in
which perhaps every character above U+C000 would have been represented
in an alternative to UTF-16 as a pair of characters, the first with the
top three bits 110 and the second with the top three bits 111 - leaving
26 bits for indicating a character. But this kind of mechanism could not
be introduced after the fact, after a late decision to extend Unicode
from 16 bits to 21 bits, because of the need or decision to remain
compatible with existing UCS-16 encodings of some characters in your
"R-zone".

So, Hans, all of this is theoretical as Doug has made clear. Even if we
can all agree post facto on an improved encoding, there is far too much
investment in UTF-16 for it ever to be changed. And UTF-16, which cannot
be deprecated, requires these code points to be reserved. But there is
no shortage of code points, so what's the problem?

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.6 - Release Date: 27/01/2005

Next message: Hans Aberg: "Re: Surrogate points"
Previous message: Eric Carwardine: "Arabic and HTML"
In reply to: Doug Ewell: "Re: Surrogate points"
Next in thread: Peter Constable: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 31 2005 - 10:45:28 CST