Re: Roundtripping in Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 15:27:57 CST

  • Next message: Philippe Verdy: "Re: Roundtripping in Unicode"

    Marcin Kowalczyk noted:

    > Unicode has the following property. Consider sequences of valid
    > Unicode characters: from the range U+0000..U+10FFFF, excluding
    > non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
    > U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
    > in any UTF-n, and nothing else is expected from UTF-n.

    Actually not quite correct. See Section 3.9 of the standard.

    The character encoding forms (UTF-8, UTF-16, UTF-32) are defined
    on the range of scalar values for Unicode: 0..D7FF, E000..10FFFF.

    Each of the UTF's can represent all of those scalar values, and
    can be converted accurately to either of the other UTF's for
    each of those values. That *includes* all the code points used
    for noncharacters.

    U+FFFF is a noncharacter. It is not assigned to an encoded
    abstract character. However, it has a well-formed representation
    in each of the UTF-8, UTF-16, and UTF32 encoding forms,
    namely:

    UTF-8: <EF BF BF>
    UTF-16: <FFFF>
    UTF-32: <0000FFFF>

    > With the exception of the set of non-characters being irregular and
    > IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
    > limit caused by UTF-16, this gives a precise and unambiguous set of
    > values for which encoders and decoders are supposed to work.

    Well, since conformant encoders and decoders must work for all
    the noncharacter code points as well, and since U+10FFFF, however
    odd numerologically, is itself precise and unambiguous, I don't
    think you even need these qualifications.

    > Well,
    > except non-obvious treatment of a BOM (at which level it should be
    > stripped? does this include UTF-8?).

    The handling of BOM is relevant to the character encoding *schemes*,
    where the issues are serialization into byte streams and interpretation
    of those byte streams. Whether you include U+FEFF in text or not
    depends on your interpretation of the encoding scheme for a Unicode
    byte stream.

    At the level of the character encoding forms (the UTF's), the
    handling of BOM is just as for any other scalar value, and is
    completely unambiguous:

    UTF-8: <EF BB BF>
    UTF-16: <FEFF>
    UTF-32: <0000FEFF>

    >
    > A variant of UTF-8 which includes all byte sequences yields a much
    > less regular set of abstract string values. Especially if we consider
    > that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
    > 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
    > order for a BOM to fulfill its role).

    This is incorrect. <EF BF BE> *is* valid UTF-8, just as <FFFE> is
    valid UTF-16. In both cases these are valid representations of
    a noncharacter, which should not be used in public interchange,
    but that is a separate issue from the fact that the code unit
    sequences themselves are "well-formed" by definition of the
    Unicode encoding forms.

    >
    > Question: should a new programming language which uses Unicode for
    > string representation allow non-characters in strings?

    Yes.

    > Argument for
    > allowing them: otherwise they are completely useless at all, except
    > U+FFFE for BOM detection. Argument for disallowing them: they make
    > UTF-n inappropriate for serialization of arbitrary strings, and thus
    > non-standard extensions of UTF-n must be used for serialization.

    Incorrect. See above. No extensions of any of the encoding forms
    are needed to handle noncharacters correctly.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 15:30:19 CST