Re: Roundtripping in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 11 2004 - 18:02:30 CST

  • Next message: Philippe Verdy: "Re: Please RSVP... (was: US-ASCII)"

    From: "Doug Ewell" <dewell@adelphia.net>
    > Lars Kristan wrote:
    >> I am sure one of the standardizers will find a Unicodally
    >> correct way of putting it.
    >
    > I can't even understand that paragraph, let alone paraphrase it.

    My understanding of his question and my reponse to his problem is that you
    MUST not use VALID Unicode codepoints to represent INVALID byte sequences
    found in some text with alleged UTF encoding.

    The only way is to use INVALID codepoints, out of the Unicode space, and
    then design an encoding scheme that contains and extends the Unicode UTF,
    and make sure that there will be no possible interaction between such
    encoded binary data and encoded plain text (so the conversion between the
    encoding scheme of the bytes stream and the encoding form with code units or
    codepoints in memory must be fully bijective; it is hard to design if you
    have to also support multiple UTF encoding schemes, because the invalid byte
    sequences of these UTF schemes are not the same, and must then be
    represented with distinct invalid codepoints or code units for each external
    UTF!)

    I won't support the idea of reserving some valid codepoint in the Unicode
    space to allow storing something which is already considered invalid
    character data, notably because the Unicode standard is evolving, and such
    private encoding form which would work now could become incompatible with a
    later version of the Unicode standard, or a later standardized Unicode
    encoding scheme, meaning that interoperability would be lost...

    The only thing for which you have a guarantee that Unicode will not assign a
    mandatory behavior is the codepoint space after U+10FFFF (I'm not sure about
    the permanent invalidity of some code unit spaces in UTF-8 and UTF-16
    encoding forms; also I'm not sure that there will be enough free space in
    later standard encoding forms or schemes, see for example SCSU or BOCU-1, or
    with other already used private encoding forms like the "modified UTF-8"
    extended encoding scheme defined by Sun in Java).



    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 18:03:38 CST