RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 13 2004 - 04:31:38 CST

Next message: Arcane Jill: "Re: When to validate?"

Previous message: Peter Kirk: "Re: infinite combinations, was Re: Nicest UTF"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Marcin 'Qrczak' Kowalczyk wrote:
> You are trying to stick with processing byte sequences, carefully
> preserving the storage format instead of preserving the meaning in
> terms of Unicode characters. This leads to less robust software
> which is not certain about the encoding of texts it processes and
> thus can't apply algorithms like case mapping without risking doing
> a meaningless damage to the text.
I am not proposing that this approach is better or that it should be used
generally. What I am saying is that this approach is, unfortunately, needed
in order to make the transition easier. The fact is that currently data
exists that cannot be converted easily. An over-robust software, in my
opinion, can be impratcical and might not be accepted with open hands. We
should acknowledge the fact that some products will chose a different path.
You can say these applications will be less robust, but we should really
give the the users a choice and let them decide what they want.

> Conversion should signal an error by default. Replacing errors by
> U+FFFD should be done only when the data is processed purely for
> showing it to the user, without any further processing, i.e. when it's
> better to show the text partially even if we know that it's corrupted.
I think showing it to the user is not the only case when you need to use
U+FFFD. A text viewer could do the replacement when reading the file and do
further processing in Unicode. But an editor cannot. Keeping the text in
original binary form is far from practical and opens numerous possibilities
for bugs. But, as I once already said, you can do it with UTF-8, you simply
keep the invalid sequences as they are, and really handle them differently
only when you actually process them or display them. But you cannot do this
in UTF-16, since you cannot preserve all the data.

As for signalling - in some cases signalling is impossible. Listing files in
a directory should not signal anything. It MUST return all files and it
should also return them in a way that this list can be used to access each
of the files.

>
> > Either you do everything in UTF-8, or everything in UTF-16. Not
> > always, but typically. If comparisons are not always done in the
> > same UTF, then you need to validate. And not validate while
> > converting, but validate on its own. And now many designers will
> > remember that they didn't. So, all UTF-8 programs (of that kind)
> > will need to be fixed. Well, might as well adopt my broken
> > conversion and fix all UTF-16 programs. Again, of that kind, not all
> > in general, so there are few. And even those would not be all
> > affected. It would depend on which conversion is used where. Things
> > could be worked out. Even if we would start changing all the
> > conversions. Even more so if a new conversion is added and only used
> > when specifically requested.
>
> I don't understand anything of this.
Let's start with UTF-8 usernames. This is a likely scenario, since I think
UTF-8 will typically be used in network communication. If you store the
usernames in UTF-16, the conversion will signal an error and you will not
have any users with invalid UTF-8 sequences nor will any invalid sequence be
able to match any user. If you later on start comparing users somewhere
else, in UTF-8, then you must not only strcmp them, but also validate each
string. This is just a fact and I am not complaining about it.

In the opposite case, if you would have UTF-8 storage and UTF-16
communication, and any comparisons would be done in UTF-16, you again need
to validate the UTF-16 strings.

Now I am supposing that there are such applications already out there. And
that some of them do not validate (or validate only in conversion, but not
when comparing or otherwise processing native strings).

They should be analyzed and fixed. At the time I wrote the above paragraph,
I though UTF-16 programs don't need to validate, but that is not true, so
all the applications need to be fixed, if they are not already validating.

Now, suppose my 'broken' conversion is standardized. As an option, not for
UTF-16 to UTF-8 conversion. If you don't start using it, the existing rules
apply.

The interesting thing is that if you do start using my conversion, you can
actually get rid of the need to validate UTF-8 strings in the first
scenario. That of course means you will allow users with invalid UTF-8
sequences, but if one determines that this is acceptable (or even desired),
then it makes things easier. But the choice is yours.

For the second scenario, things do indeed become a bit more complicated. But
can be solved. And there is still a number of choices you can make about the
level of validation. And, again, one of them is that you keep using the
existing conversion and the existing validation.

>
> > I cannot afford not to access the files.
>
> Then you have two choices:
> - Don't use Unicode.
As soon as a Windows system enters the picture, it is practically impossible
to not use Unicode. Or if a UNIX user uses a UTF-8 locale.

> - Pretend that filenames are encoded in ISO-8859-1, and represent them
> as a sequence of code points U+0001..U+00FF. They will not
> be displayed
> correctly but the information will be preserved.
Been there, done that. Works in one way, but not the other. And becomes
increasingly less useful as more and more data is in UTF-8.

Lars

Next message: Arcane Jill: "Re: When to validate?"
Previous message: Peter Kirk: "Re: infinite combinations, was Re: Nicest UTF"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 04:35:40 CST