RE: RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Mon Dec 13 2004 - 09:35:09 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"

    Philippe VERDY wrote:
    > If a source sequence is invalid, and you want to preserve it,
    > then this sequence must remain invalid if you change its encoding.
    > So there's no need for Unicode to assign valid code points
    > for invalid source data.
    Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a
    known approach (UTF-8B, if I remember correctly). But this is then not
    UTF-16 data so you don't gain much. The data is at risk of being rejeted or
    filtered out at any time. And that misses the whole point.

    Specifically, unpaired surrogates that are used in the UTF-8B conversion
    have additional risks, but that is not the issue now.

    > Using PUA space or some unassigned space in Unicode to
    > represent invalid sequences present in a source text will be
    > a severe design error in all cases, because that conversion
    > will not be bejective and could map invalid sequences to
    > valid ones without further notice, changing the status of the
    > original text which should be kept as incorrectly encoded,
    > until explicitly corrected or until the source text is
    > reparsed with another more appriate encoding.
    Again, I am not changing the UTF-8 definition. In places where I do decide
    to interpret the 128 codepoints differently, it is my responsibility to
    understand the risks. If there is a risk, I can prevent it. If there is no
    risk, then I don't need to do anything. Thanks for the warning, but may I be
    allowed to decide whether it applies to me or not? Or will you insist that
    such codepoints should not be assigned to protect the innocent? Let's stop
    producing knives. They're dangerous.

    > (In fact I also think that mapping invalid sequences to
    > U+FFFD is also an error, because U+FFFD is valid, and the
    > presence of the encoding error in the source is lost, and
    > will not throw exceptions in further processings of the
    > remapped text, unless the application constantly checks for
    > the presence of U+FFFD in the text stream, and all modules in
    > the application explicitly forbids U+FFFD within its interface...)
    Generally, no, most definitely not. Your concern is ONLY valid in security
    related processing. In data processing, you must preserve the data. U+FFFD
    is a valid codepoint. A certain application may treat it as special, just as
    another might treat '/' as special. But you are almost suggesting that
    U+FFFD is invalid and should be signalled all over. When you realize that
    U+FFFD is just a codepoint, then you will also understand that codepoints
    for invalid sequences must also be codepoints. Valid codepoints.

    I think my ideas are often misunderstood because I speak mainly of using
    these codepoints for preserving the invalid sequences. Leading to conclusion
    that I want to corrupt UTF-8. But that is not so. For one, this mechanism is
    not intended to replace neither decoding UTF-8, nor encoding UTF-8. It is to
    be used on interfaces that cannot guarantee pure UTF-8 data. And UTF-8 is
    just an example, one can use the replacement codepoints for preserving bytes
    in other encodings, for example a 0xA5 in Latin 3.


    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:39:58 CST