From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Dec 14 2004 - 14:03:14 CST
"Arcane Jill" <arcanejill@ramonsky.com> writes:
> OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
> NOT-UTF-16 -> NOT-UTF-8
But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.
Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+0000..U+10FFFF, excluding
non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.
With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).
A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).
Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.
-- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 14:05:01 CST