From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 16:13:59 CST
RE: Nicest UTFFrom: Lars Kristan
> I agree. But not for reasons you mentioned. There is one other important
> advantage:
> UTF-8 is stored in a way that permits storing invalid sequences. I will
> need to
> elaborate that, of course.
Not true for UTF-8. UTF-8 can only store valid sequences of code points, in
the valid range from U+0000 to U+D7FF and U+E000 to U+10FFFF (so excluding
surrogate code points).
But it's true that there are non standard extensions of UTF-8 (such as Sun's
one for Java) that allow escaping some byte values normally generated by the
standard UTF-8 (notably the single byte 0x00 representing U+0000), or that
allow representing isolated or incorrectely paired surrogate code points
which may be present in a normally invalid Unicode string, or that allow to
represent non-BMP characters with 6 bytes, where each pair of 3 bytes
represent surrogate code units (not code points!).
Only the CESU-8 variant of UTF-8 is documented and standardized (where
non-BMP characters are represented by encoding on two groups of 3 bytes the
two surrogate code units that would be used in UTF-16 to represent the same
character). CESU-8 is less efficient than UTF-8, but even in that case it
does not allow representing invalid Unicode strings containing surrogate
*code points* which are not characters (I did not say *code units*), even if
they are apparently correctly "paired" (the concept of paired surrogates
only exist within the UTF-16 encoding scheme, that represent strings not as
stream of characters coded with code points, but as streams of 16-bit code
units).
If you need extensions like this, you do because you need to represent data
which is not valid Unicode text. Such extended scheme is not a UTF, but a
serialization format for this type of data (even if this type can represent
all instances of valid Unicode text).
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 16:17:51 CST