Re: Surrogate points

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jan 28 2005 - 11:53:08 CST

Next message: Jörg Knappen: "Re: [africa] Re: The Yoruba under-diacritic"

Previous message: Jon Hanna: "RE: [Humor] Hey, now here's a silly thought"
Maybe in reply to: Hans Aberg: "Surrogate points"
Next in thread: Jon Hanna: "RE: Surrogate points"
Reply: Jon Hanna: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 20:21 +0100 2005/01/27, Philippe Verdy wrote:
>> Their semantic interpretation is the same as that of empty slots,
>> though promised not to be filled.
>
>Wrong. "Empty slots" (unallocated character) are not illegal in
>conforming documents (because these documents may have been created in
>a later version of Unicode/ISO/IEC 10646, where these characters will
>have been allocated).
>
>So an application should not reject unallocated characters as if they
>were invalid (the application can still display a "missing" glyph to
>warn the user that the character is not known).
>
>But an application that "sees" a surrogate code point should treat it
>as an error in the stream of code units or the stream of bytes in any
>encoding scheme that would represent this codepoint.

You have a point in that these slots should be currently be treated as
though not belonging to the Unicode domain. So the suggestion is then to
(sematically) move them into the valid Unicode range, though promised to
remain empty until it is pragmatic to fill them.

Otherwise programs should do whatever they find prudent. For example, a
debugging tool might find it important to display these as non-errors,
relative the debugging program that is. The character part Unicode should
only focus on defining the notion of wellformed characters and character
strings.

>Note that UTF-16 (encoding form), or UTF-16/UTF-16BE/UTF-16LE (encoding
>schemes) do not allow encoding surrogate codepoints. They only allow
>using pairs of surrogate code units, to encode another non-surrogate
>codepoint.

In their current form; they would been to be formally altered. There is
conceptual problem of tying UTF-16 to the Unicode point range. A change,
clearly separating these two distinct entities, character set and encoding
might call pragmaticly for leaving these slots empty for a designated period
of time. But it would lessen the confusion.

Hans Aberg

Next message: Jörg Knappen: "Re: [africa] Re: The Yoruba under-diacritic"
Previous message: Jon Hanna: "RE: [Humor] Hey, now here's a silly thought"
Maybe in reply to: Hans Aberg: "Surrogate points"
Next in thread: Jon Hanna: "RE: Surrogate points"
Reply: Jon Hanna: "RE: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 28 2005 - 12:01:08 CST