Re: Surrogate points

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 31 2005 - 05:09:47 CST

  • Next message: Hans Aberg: "Re: Surrogate points"

    On 30/01/2005 22:18, Doug Ewell wrote:

    >Hans Aberg <haberg at math dot su dot se> wrote:
    >
    >
    >
    >>The numbers 0xD800-0xDFFF, 0xFFFE-0xFFFF are not associated with
    >>character, but included as place holders, never to be used, because
    >>one has failed to give the encoding UTF-16 a proper design. So an
    >>unrelated problem, choice of character encoding, is allowed to
    >>influence the logical core, the character set description.
    >>
    >>
    >
    >...
    >
    >In any case, it is incorrect to state that the choice of this block was
    >due to "failure to given UTF-16 a proper design." Other blocks, such as
    >the "obvious" 0xF800 through 0xFFFF, were already occupied.
    >
    >
    >
    Doug, I think you have missed Hans' point, which is surely that if
    Unicode had been designed from the start as a 21-bit space or whatever,
    it is unlikely that this surrogate pair mechanism would have been used
    to encode characters beyond the first 64K, and there would not have
    been a need to reserve this large block of code points. Instead I would
    guess that a mechanism more like UTF-8 would have been introduced, in
    which perhaps every character above U+C000 would have been represented
    in an alternative to UTF-16 as a pair of characters, the first with the
    top three bits 110 and the second with the top three bits 111 - leaving
    26 bits for indicating a character. But this kind of mechanism could not
    be introduced after the fact, after a late decision to extend Unicode
    from 16 bits to 21 bits, because of the need or decision to remain
    compatible with existing UCS-16 encodings of some characters in your
    "R-zone".

    So, Hans, all of this is theoretical as Doug has made clear. Even if we
    can all agree post facto on an improved encoding, there is far too much
    investment in UTF-16 for it ever to be changed. And UTF-16, which cannot
    be deprecated, requires these code points to be reserved. But there is
    no shortage of code points, so what's the problem?

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.6 - Release Date: 27/01/2005
    


    This archive was generated by hypermail 2.1.5 : Mon Jan 31 2005 - 10:45:28 CST