Re: Unicode 4.0 BETA available for review

From: Kenneth Whistler (
Date: Thu Feb 27 2003 - 14:38:36 EST

  • Next message: Roozbeh Pournader: "Re: Unicode 4.0 BETA available for review"

    Stefan Persson suggested:

    > >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
    > >sequences. There were two types:
    > >
    > > a. 0xC0 0x80 for U+0000 (instead of 0x00)
    > > b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80
    > >
    > >
    > Ah, but encoding NULL as a surrogate character and then encoding those
    > two surrogates as three bytes, making totally 6 bytes a character, would
    > also be technically possible (though not legal), right?

    I'm not sure what you are talking about, here.

    First of all, there is no such thing as a "surrogate character",
    under the terminology currently adopted by the standard.

    There are surrogate code points: U+D800..U+DFFF. Those can
    *never* be assigned to any abstract character.

    Then there are surrogate code units: 0xD800..0xDFFF. Those are
    used in pairs in the UTF-16 encoding form to represent a single
    supplementary character (one encoded off the BMP).

    NULL is U+0000.
      Its representation in UTF-32 is <0x00000000>.
      Its representation in UTF-16 is <0x0000>.
      Its representation in UTF-8 is <0x00>.
    Period. End of story. Anything else is nonconformant to the standard.


    This archive was generated by hypermail 2.1.5 : Fri Feb 28 2003 - 02:37:35 EST