Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 16:13:59 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Lars Kristan: "RE: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

RE: Nicest UTFFrom: Lars Kristan
> I agree. But not for reasons you mentioned. There is one other important
> advantage:
> UTF-8 is stored in a way that permits storing invalid sequences. I will
> need to
> elaborate that, of course.

Not true for UTF-8. UTF-8 can only store valid sequences of code points, in
the valid range from U+0000 to U+D7FF and U+E000 to U+10FFFF (so excluding
surrogate code points).

But it's true that there are non standard extensions of UTF-8 (such as Sun's
one for Java) that allow escaping some byte values normally generated by the
standard UTF-8 (notably the single byte 0x00 representing U+0000), or that
allow representing isolated or incorrectely paired surrogate code points
which may be present in a normally invalid Unicode string, or that allow to
represent non-BMP characters with 6 bytes, where each pair of 3 bytes
represent surrogate code units (not code points!).

Only the CESU-8 variant of UTF-8 is documented and standardized (where
non-BMP characters are represented by encoding on two groups of 3 bytes the
two surrogate code units that would be used in UTF-16 to represent the same
character). CESU-8 is less efficient than UTF-8, but even in that case it
does not allow representing invalid Unicode strings containing surrogate
*code points* which are not characters (I did not say *code units*), even if
they are apparently correctly "paired" (the concept of paired surrogates
only exist within the UTF-16 encoding scheme, that represent strings not as
stream of characters coded with code points, but as streams of 16-bit code
units).

If you need extensions like this, you do because you need to represent data
which is not valid Unicode text. Such extended scheme is not a UTF, but a
serialization format for this type of data (even if this type can represent
all instances of valid Unicode text).

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Lars Kristan: "RE: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 16:17:51 CST