Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Dec 14 2004 - 14:03:14 CST

Next message: Kenneth Whistler: "RE: Roundtripping in Unicode"

Previous message: Peter Kirk: "Re: Roundtripping in Unicode"
In reply to: Arcane Jill: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Arcane Jill" <arcanejill@ramonsky.com> writes:

> OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
> NOT-UTF-16 -> NOT-UTF-8

But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.

Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+0000..U+10FFFF, excluding
non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.

With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).

A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).

Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Kenneth Whistler: "RE: Roundtripping in Unicode"
Previous message: Peter Kirk: "Re: Roundtripping in Unicode"
In reply to: Arcane Jill: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 14:05:01 CST