Re: Nicest UTF

From: Philippe Verdy (
Date: Fri Dec 03 2004 - 16:13:59 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    RE: Nicest UTFFrom: Lars Kristan
    > I agree. But not for reasons you mentioned. There is one other important
    > advantage:
    > UTF-8 is stored in a way that permits storing invalid sequences. I will
    > need to
    > elaborate that, of course.

    Not true for UTF-8. UTF-8 can only store valid sequences of code points, in
    the valid range from U+0000 to U+D7FF and U+E000 to U+10FFFF (so excluding
    surrogate code points).

    But it's true that there are non standard extensions of UTF-8 (such as Sun's
    one for Java) that allow escaping some byte values normally generated by the
    standard UTF-8 (notably the single byte 0x00 representing U+0000), or that
    allow representing isolated or incorrectely paired surrogate code points
    which may be present in a normally invalid Unicode string, or that allow to
    represent non-BMP characters with 6 bytes, where each pair of 3 bytes
    represent surrogate code units (not code points!).

    Only the CESU-8 variant of UTF-8 is documented and standardized (where
    non-BMP characters are represented by encoding on two groups of 3 bytes the
    two surrogate code units that would be used in UTF-16 to represent the same
    character). CESU-8 is less efficient than UTF-8, but even in that case it
    does not allow representing invalid Unicode strings containing surrogate
    *code points* which are not characters (I did not say *code units*), even if
    they are apparently correctly "paired" (the concept of paired surrogates
    only exist within the UTF-16 encoding scheme, that represent strings not as
    stream of characters coded with code points, but as streams of 16-bit code

    If you need extensions like this, you do because you need to represent data
    which is not valid Unicode text. Such extended scheme is not a UTF, but a
    serialization format for this type of data (even if this type can represent
    all instances of valid Unicode text).

    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 16:17:51 CST