Re: Stateful encoding mechanisms

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 20 2005 - 10:04:26 CDT

  • Next message: Tom Emerson: "Re: ASCII and Unicode lifespan"

    From: "Dean Snyder" <dean.snyder@jhu.edu>
    > By the way, can you indeed tell us what the "unique status" of the code
    > unit 0xDF02 is? And if it has one, why it is not spelled out in the
    > standard?

    It is in the standard:
    * the code unit 0xDF02 is a surrogate.
    * the codepoint U+DF02 is permenantly a non-character:
    * there's no assigned character on U+DF02, it will never be assigned to
    character by ISO/IEC 10646-1 or Unicode, because it is already bound to a
    non-character.

    Unicode works at the character level only, and only for plain text. Code
    units are only part of serialization mechanisms to interchange text data in
    memory or across systems. Code units are not plain-text, and even a file
    encoded with UTF-16 codeunits is not necessarily plain-text, as it may
    decode into a stream of codepoints not assigned to characters (i.e.
    <reserved> until further assignment, or <non-character> like the surrogates
    or U+FFFE and U+FFFF).

    An application handling plain-text at the codepoint level will then never
    see any codepoint whose value is 0xDF02. If this happens, there's a serious
    bug in the (de)serialization routines that perform I/O over streams of code
    units or of bytes (with encoding schemes): these routines are then
    non-conforming.

    (On the opposite, PUAs are assigned as Unicode characters.)



    This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 10:05:07 CDT