RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Sat Dec 11 2004 - 13:05:38 CST

  • Next message: Lars Kristan: "RE: Nicest UTF"

    Marcin 'Qrczak' Kowalczyk wrote:
    > Lars Kristan <> writes:
    > > All assigned codepoints do roundtrip even in my concept.
    > > But unassigned codepoints are not valid data.
    > Please make up your mind: either they are valid and programs are
    > required to accept them, or they are invalid and programs are required
    > to reject them.
    I don't know what they should be called. The fact is there shouldn't be any.
    And that current software should treat them as valid. So, they are not valid
    but cannot (and must not) be validated. As stupid as it sounds. I am sure
    one of the standardizers will find a Unicodally correct way of putting it.

    > > Furthermore, I was proposing this concept to be used, but not
    > > unconditionally. So, you can, possibly even should, keep using
    > > whatever you are using.
    > So you prefer to make programs misbehave in unpredictable ways
    > (when they pass the data from a component which uses relaxed rules
    > to a component which uses strict rules) rather than have a clear and
    > unambiguous notion of a valid UTF-8?
    I am not particulary thrilled about it. In fact it should be discussed.
    Constructively. Simply assuming everything will break is not helpful. But if
    you want an answer, yes, I would go for it. Actually, there are fewer
    concerns involved than people think. Security is definitely an issue. But
    again, one shouldn't assume it breaks just like that. Let me risk a bold
    statement: security is typically implicitly centralized. And if comparison
    is always done in the same UTF, it won't break. A simple fact that two
    different UTF-16 strings compare equal in UTF-8 (after relaxed conversion),
    does not introduce a security issue. Today, two invalid UTF-8 strings
    compare the same in UTF-16, after a valid conversion (using a single
    replacement char, U+FFFD) and they compare different in their original form,
    if you use strcmp. But you probably don't. Either you do everything in
    UTF-8, or everything in UTF-16. Not always, but typically. If comparisons
    are not always done in the same UTF, then you need to validate. And not
    validate while converting, but validate on its own. And now many designers
    will remember that they didn't. So, all UTF-8 programs (of that kind) will
    need to be fixed. Well, might as well adopt my broken conversion and fix all
    UTF-16 programs. Again, of that kind, not all in general, so there are few.
    And even those would not be all affected. It would depend on which
    conversion is used where. Things could be worked out. Even if we would start
    changing all the conversions. Even more so if a new conversion is added and
    only used when specifically requested.

    There is cost and there are risks. Nothing should be done hastily. But let's
    go back and ask ourselves what are the benefits. And evaluate the whole.

    > > Perhaps I can convert mine, but I cannot convert all filenames on
    > > a user's system.
    > They you can't access his files.
    Yes, this is where it all started. I cannot afford not to access the files.
    I am not writing a notepad.

    > With your proposal you couldn't as well, because you don't make them
    > valid unconditionally. Some programs would access them and some would
    > break, and it's not clear what should be fixed: programs or filenames.
    It is important to have a way to write programs that can. And, there is
    definitely nothing to be fixed about the filenames. They are there and
    nobody will bother to change them. It is the programs that need to be fixed.
    And if Unicode needs to be fixed to allow that, then that is what is
    supposed to happen. Eventually.


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 13:10:12 CST