Re: Surrogate points

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jan 28 2005 - 11:53:08 CST

  • Next message: Jörg Knappen: "Re: [africa] Re: The Yoruba under-diacritic"

    At 20:21 +0100 2005/01/27, Philippe Verdy wrote:
    >> Their semantic interpretation is the same as that of empty slots,
    >> though promised not to be filled.
    >
    >Wrong. "Empty slots" (unallocated character) are not illegal in
    >conforming documents (because these documents may have been created in
    >a later version of Unicode/ISO/IEC 10646, where these characters will
    >have been allocated).
    >
    >So an application should not reject unallocated characters as if they
    >were invalid (the application can still display a "missing" glyph to
    >warn the user that the character is not known).
    >
    >But an application that "sees" a surrogate code point should treat it
    >as an error in the stream of code units or the stream of bytes in any
    >encoding scheme that would represent this codepoint.

    You have a point in that these slots should be currently be treated as
    though not belonging to the Unicode domain. So the suggestion is then to
    (sematically) move them into the valid Unicode range, though promised to
    remain empty until it is pragmatic to fill them.

    Otherwise programs should do whatever they find prudent. For example, a
    debugging tool might find it important to display these as non-errors,
    relative the debugging program that is. The character part Unicode should
    only focus on defining the notion of wellformed characters and character
    strings.

    >Note that UTF-16 (encoding form), or UTF-16/UTF-16BE/UTF-16LE (encoding
    >schemes) do not allow encoding surrogate codepoints. They only allow
    >using pairs of surrogate code units, to encode another non-surrogate
    >codepoint.

    In their current form; they would been to be formally altered. There is
    conceptual problem of tying UTF-16 to the Unicode point range. A change,
    clearly separating these two distinct entities, character set and encoding
    might call pragmaticly for leaving these slots empty for a designated period
    of time. But it would lessen the confusion.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Jan 28 2005 - 12:01:08 CST