From: Lars Kristan (firstname.lastname@example.org)
Date: Mon Dec 13 2004 - 09:35:09 CST
Philippe VERDY wrote:
> If a source sequence is invalid, and you want to preserve it,
> then this sequence must remain invalid if you change its encoding.
> So there's no need for Unicode to assign valid code points
> for invalid source data.
Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a
known approach (UTF-8B, if I remember correctly). But this is then not
UTF-16 data so you don't gain much. The data is at risk of being rejeted or
filtered out at any time. And that misses the whole point.
Specifically, unpaired surrogates that are used in the UTF-8B conversion
have additional risks, but that is not the issue now.
> Using PUA space or some unassigned space in Unicode to
> represent invalid sequences present in a source text will be
> a severe design error in all cases, because that conversion
> will not be bejective and could map invalid sequences to
> valid ones without further notice, changing the status of the
> original text which should be kept as incorrectly encoded,
> until explicitly corrected or until the source text is
> reparsed with another more appriate encoding.
Again, I am not changing the UTF-8 definition. In places where I do decide
to interpret the 128 codepoints differently, it is my responsibility to
understand the risks. If there is a risk, I can prevent it. If there is no
risk, then I don't need to do anything. Thanks for the warning, but may I be
allowed to decide whether it applies to me or not? Or will you insist that
such codepoints should not be assigned to protect the innocent? Let's stop
producing knives. They're dangerous.
> (In fact I also think that mapping invalid sequences to
> U+FFFD is also an error, because U+FFFD is valid, and the
> presence of the encoding error in the source is lost, and
> will not throw exceptions in further processings of the
> remapped text, unless the application constantly checks for
> the presence of U+FFFD in the text stream, and all modules in
> the application explicitly forbids U+FFFD within its interface...)
Generally, no, most definitely not. Your concern is ONLY valid in security
related processing. In data processing, you must preserve the data. U+FFFD
is a valid codepoint. A certain application may treat it as special, just as
another might treat '/' as special. But you are almost suggesting that
U+FFFD is invalid and should be signalled all over. When you realize that
U+FFFD is just a codepoint, then you will also understand that codepoints
for invalid sequences must also be codepoints. Valid codepoints.
I think my ideas are often misunderstood because I speak mainly of using
these codepoints for preserving the invalid sequences. Leading to conclusion
that I want to corrupt UTF-8. But that is not so. For one, this mechanism is
not intended to replace neither decoding UTF-8, nor encoding UTF-8. It is to
be used on interfaces that cannot guarantee pure UTF-8 data. And UTF-8 is
just an example, one can use the replacement codepoints for preserving bytes
in other encodings, for example a 0xA5 in Latin 3.
This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:39:58 CST