RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Mon Dec 13 2004 - 04:31:38 CST

  • Next message: Arcane Jill: "Re: When to validate?"

    Marcin 'Qrczak' Kowalczyk wrote:
    > You are trying to stick with processing byte sequences, carefully
    > preserving the storage format instead of preserving the meaning in
    > terms of Unicode characters. This leads to less robust software
    > which is not certain about the encoding of texts it processes and
    > thus can't apply algorithms like case mapping without risking doing
    > a meaningless damage to the text.
    I am not proposing that this approach is better or that it should be used
    generally. What I am saying is that this approach is, unfortunately, needed
    in order to make the transition easier. The fact is that currently data
    exists that cannot be converted easily. An over-robust software, in my
    opinion, can be impratcical and might not be accepted with open hands. We
    should acknowledge the fact that some products will chose a different path.
    You can say these applications will be less robust, but we should really
    give the the users a choice and let them decide what they want.

    > Conversion should signal an error by default. Replacing errors by
    > U+FFFD should be done only when the data is processed purely for
    > showing it to the user, without any further processing, i.e. when it's
    > better to show the text partially even if we know that it's corrupted.
    I think showing it to the user is not the only case when you need to use
    U+FFFD. A text viewer could do the replacement when reading the file and do
    further processing in Unicode. But an editor cannot. Keeping the text in
    original binary form is far from practical and opens numerous possibilities
    for bugs. But, as I once already said, you can do it with UTF-8, you simply
    keep the invalid sequences as they are, and really handle them differently
    only when you actually process them or display them. But you cannot do this
    in UTF-16, since you cannot preserve all the data.

    As for signalling - in some cases signalling is impossible. Listing files in
    a directory should not signal anything. It MUST return all files and it
    should also return them in a way that this list can be used to access each
    of the files.

    > > Either you do everything in UTF-8, or everything in UTF-16. Not
    > > always, but typically. If comparisons are not always done in the
    > > same UTF, then you need to validate. And not validate while
    > > converting, but validate on its own. And now many designers will
    > > remember that they didn't. So, all UTF-8 programs (of that kind)
    > > will need to be fixed. Well, might as well adopt my broken
    > > conversion and fix all UTF-16 programs. Again, of that kind, not all
    > > in general, so there are few. And even those would not be all
    > > affected. It would depend on which conversion is used where. Things
    > > could be worked out. Even if we would start changing all the
    > > conversions. Even more so if a new conversion is added and only used
    > > when specifically requested.
    > I don't understand anything of this.
    Let's start with UTF-8 usernames. This is a likely scenario, since I think
    UTF-8 will typically be used in network communication. If you store the
    usernames in UTF-16, the conversion will signal an error and you will not
    have any users with invalid UTF-8 sequences nor will any invalid sequence be
    able to match any user. If you later on start comparing users somewhere
    else, in UTF-8, then you must not only strcmp them, but also validate each
    string. This is just a fact and I am not complaining about it.

    In the opposite case, if you would have UTF-8 storage and UTF-16
    communication, and any comparisons would be done in UTF-16, you again need
    to validate the UTF-16 strings.

    Now I am supposing that there are such applications already out there. And
    that some of them do not validate (or validate only in conversion, but not
    when comparing or otherwise processing native strings).

    They should be analyzed and fixed. At the time I wrote the above paragraph,
    I though UTF-16 programs don't need to validate, but that is not true, so
    all the applications need to be fixed, if they are not already validating.

    Now, suppose my 'broken' conversion is standardized. As an option, not for
    UTF-16 to UTF-8 conversion. If you don't start using it, the existing rules

    The interesting thing is that if you do start using my conversion, you can
    actually get rid of the need to validate UTF-8 strings in the first
    scenario. That of course means you will allow users with invalid UTF-8
    sequences, but if one determines that this is acceptable (or even desired),
    then it makes things easier. But the choice is yours.

    For the second scenario, things do indeed become a bit more complicated. But
    can be solved. And there is still a number of choices you can make about the
    level of validation. And, again, one of them is that you keep using the
    existing conversion and the existing validation.

    > > I cannot afford not to access the files.
    > Then you have two choices:
    > - Don't use Unicode.
    As soon as a Windows system enters the picture, it is practically impossible
    to not use Unicode. Or if a UNIX user uses a UTF-8 locale.

    > - Pretend that filenames are encoded in ISO-8859-1, and represent them
    > as a sequence of code points U+0001..U+00FF. They will not
    > be displayed
    > correctly but the information will be preserved.
    Been there, done that. Works in one way, but not the other. And becomes
    increasingly less useful as more and more data is in UTF-8.


    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 04:35:40 CST