Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (
Date: Tue Dec 14 2004 - 14:03:14 CST

  • Next message: Kenneth Whistler: "RE: Roundtripping in Unicode"

    "Arcane Jill" <> writes:

    > OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
    > NOT-UTF-16 -> NOT-UTF-8

    But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
    NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
    awkward way which would happen to exclude those subsequences of
    non-characters which would form a valid UTF-8 fragment.

    Unicode has the following property. Consider sequences of valid
    Unicode characters: from the range U+0000..U+10FFFF, excluding
    non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
    U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
    in any UTF-n, and nothing else is expected from UTF-n.

    With the exception of the set of non-characters being irregular and
    IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
    limit caused by UTF-16, this gives a precise and unambiguous set of
    values for which encoders and decoders are supposed to work. Well,
    except non-obvious treatment of a BOM (at which level it should be
    stripped? does this include UTF-8?).

    A variant of UTF-8 which includes all byte sequences yields a much
    less regular set of abstract string values. Especially if we consider
    that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
    0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
    order for a BOM to fulfill its role).

    Question: should a new programming language which uses Unicode for
    string representation allow non-characters in strings? Argument for
    allowing them: otherwise they are completely useless at all, except
    U+FFFE for BOM detection. Argument for disallowing them: they make
    UTF-n inappropriate for serialization of arbitrary strings, and thus
    non-standard extensions of UTF-n must be used for serialization.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 14:05:01 CST