From: Philippe Verdy (email@example.com)
Date: Sat Dec 11 2004 - 18:02:30 CST
From: "Doug Ewell" <firstname.lastname@example.org>
> Lars Kristan wrote:
>> I am sure one of the standardizers will find a Unicodally
>> correct way of putting it.
> I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you
MUST not use VALID Unicode codepoints to represent INVALID byte sequences
found in some text with alleged UTF encoding.
The only way is to use INVALID codepoints, out of the Unicode space, and
then design an encoding scheme that contains and extends the Unicode UTF,
and make sure that there will be no possible interaction between such
encoded binary data and encoded plain text (so the conversion between the
encoding scheme of the bytes stream and the encoding form with code units or
codepoints in memory must be fully bijective; it is hard to design if you
have to also support multiple UTF encoding schemes, because the invalid byte
sequences of these UTF schemes are not the same, and must then be
represented with distinct invalid codepoints or code units for each external
I won't support the idea of reserving some valid codepoint in the Unicode
space to allow storing something which is already considered invalid
character data, notably because the Unicode standard is evolving, and such
private encoding form which would work now could become incompatible with a
later version of the Unicode standard, or a later standardized Unicode
encoding scheme, meaning that interoperability would be lost...
The only thing for which you have a guarantee that Unicode will not assign a
mandatory behavior is the codepoint space after U+10FFFF (I'm not sure about
the permanent invalidity of some code unit spaces in UTF-8 and UTF-16
encoding forms; also I'm not sure that there will be enough free space in
later standard encoding forms or schemes, see for example SCSU or BOCU-1, or
with other already used private encoding forms like the "modified UTF-8"
extended encoding scheme defined by Sun in Java).
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 18:03:38 CST