Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (
Date: Sat Dec 11 2004 - 10:44:27 CST

  • Next message: Peter R. Mueller-Roemer: "infinite combinations, was Re: Nicest UTF"

    Lars Kristan <> writes:

    >> It's essential that any UTF-n can be translated to any other without
    >> loss of data. Because it allows to use an implementation of the given
    >> functionality which represents data in any form, not necessarily the
    >> form we have at hand, as long as correctness is concerned. Avoiding
    >> conversion should matter only for efficiency, not for correctness.
    > When I am talking about roundtrip, I speak of arbitrary data, not
    > just valid data.

    You want to declare all byte sequences as valid. And thus valid data
    is no longer preserved on round trip, because different UTFs are able
    to encode different sequences of code points.

    > Roundtrip for valid data is of course essential and needs to be
    > preserved.

    Your proposal does not do this.

    >> Unpaired surrogates are not valid UTF-16, and there are no surrogates
    >> in UTF-8 at all, so there is no point in trying to preserve UTF-16
    >> which is not really UTF-16.
    > Actually, there is a point. It is just that you fail to understand it.
    > But then, you needn't worry about it, since it is outside of your area
    > of interest.

    I would worry if my programs would no longer accept what Unicode
    considers valid UTF-n. And I would worry if rules defined by Unicode
    would make U+xxxx encodable as UTF-n, U+yyyy encodable too, but the
    sequence U+xxxx U+yyyy not encodable (because UTF-n would no longer
    be usable as a format for serialization of arbitrary strings of valid
    code points).

    I would also worry if an API, file format or network protocol intended
    for use by various programs required a non-standard variant of UTF-n,
    because I couldn't use standard UTF-n encoding and decoding functions
    to interoperate with it.

    I indeed don't worry in what way you abuse UTF-n, as long as it's not
    an official Unicode standard and it's not widely used in practice.

    > If UTC takes 128 unassigned codepoints and declares them to be a new
    > set of surrogates, you needn't worry either (your valid data will
    > still convert to any UTF).

    No, because it would remove responsibility to not generate such data
    and add responsibility to accept them, and thus some programs which
    are not currently broken would be broken under changed rules.

    > Unless you have a strict validator which already validates unpaired
    > surrogates. But you don't. I am pretty sure about it.

    I use system-supplied iconv() which does not accept anything which can
    be described as unpaired surrogates.

    > If a user encounters corrupt data and cannot process it with your
    > program, she ("she" is 'politically correct', but in this case can
    > be seen as sexism) will blame it on the program, not the data.

    I don't care.

    > This has been discussed mails back. UNIX filenames are already 'submitted'.
    > Once you set your locale to UTF-8, you have labelled them all as UTF-8.
    > Suggestions?

    Convert them to be valid UTF-8 (as long as locales used in the system
    use UTF-8 as the encoding, that is, otherwise keep them in the locale's

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 10:50:16 CST