Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (
Date: Mon Dec 13 2004 - 09:49:51 CST

  • Next message: Peter Constable: "RE: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download"

    Lars Kristan <> writes:

    > And once we understand that things are manageable and not as
    > frigtening as it seems at first, then we can stop using this as an
    > argument against introducing 128 codepoints. People who will find
    > them useful should and will bother with the consequences. Others
    > don't need to and can roundtrip them as today.

    A person who is against them can't ignore a motion to introduce them,
    because if they are introduced, other people / programs will start
    feeding our programs arbitrary byte sequences labeled as UTF-8
    expecting them to accept the data.

    > So, interpreting the 128 codepoints as 'recreate the original byte
    > sequence' is an option.

    Which guarantees that different programs will have different view of
    the validity and meaning of the same data labeled by the same encoding.
    Long live standarization.

    > Even I will do the same where I just want to represent Unicode in
    > UTF-8. I will only use this conversion in certain places.

    So it's not just different programs, but even the same program in
    different places. Great...

    > The fact that my conversion actually produces UTF-8 from most of
    > Unicode points does not mean it produced UTF-8.

    Increasing the number of encodings means more opportunities of
    mislabeling and using wrong libraries to process data (as it works
    "in most of cases" and thus the error is not detected immediately)
    and harder life for programs which aim at supporting all data.

    Think further than the immediate moment where many people are
    performing a transition form something to UTF-8. Look what happened
    with the interpretation of HTML in web browsers.

    If the standard from the beginning stood firmly at disallowing
    "guessing" what a malformed HTML was supposed to mean, then people
    would learn how to produce correct HTML and the interpretation would
    be unambiguous. But browsers tried to accept arbitrary contents and
    interpret parts of HTML they found there, guessing how errors should
    be resolved, being "friendly" to careless webmasters. The effect is
    that too often they submitted a webpage after checking that it works
    in their browser, but in fact it had basic syntax errors. Other
    browsers interpreted the errors differently, and the page was
    inaccessible or looked badly.

    When designing XML, they learned from this mistake:

    That's why people here reject balkanization of UTF-8 by introducing
    variations with subtle differences, like Java-modified UTF-8.

    > Inaccessible filenames are something we shouldn't accept. All your
    > discussion of non-empty empty directories is just approaching the problem
    > from the wrong end. One should fix the root cause, not consequences.

    The root cause is that users and programs use different encodings in
    different places, and thus Unix filenames can't be unambiguously and
    context-freely interpreted as character sequences.

    Unfortunately it's hard to fix.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:54:26 CST