Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Dec 11 2004 - 10:44:27 CST

Next message: Peter R. Mueller-Roemer: "infinite combinations, was Re: Nicest UTF"

Previous message: Carl W. Brown: "RE: Software support costs (was: Nicest UTF"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan <lars.kristan@hermes.si> writes:

>> It's essential that any UTF-n can be translated to any other without
>> loss of data. Because it allows to use an implementation of the given
>> functionality which represents data in any form, not necessarily the
>> form we have at hand, as long as correctness is concerned. Avoiding
>> conversion should matter only for efficiency, not for correctness.
>
> When I am talking about roundtrip, I speak of arbitrary data, not
> just valid data.

You want to declare all byte sequences as valid. And thus valid data
is no longer preserved on round trip, because different UTFs are able
to encode different sequences of code points.

> Roundtrip for valid data is of course essential and needs to be
> preserved.

Your proposal does not do this.

>> Unpaired surrogates are not valid UTF-16, and there are no surrogates
>> in UTF-8 at all, so there is no point in trying to preserve UTF-16
>> which is not really UTF-16.
>
> Actually, there is a point. It is just that you fail to understand it.
> But then, you needn't worry about it, since it is outside of your area
> of interest.

I would worry if my programs would no longer accept what Unicode
considers valid UTF-n. And I would worry if rules defined by Unicode
would make U+xxxx encodable as UTF-n, U+yyyy encodable too, but the
sequence U+xxxx U+yyyy not encodable (because UTF-n would no longer
be usable as a format for serialization of arbitrary strings of valid
code points).

I would also worry if an API, file format or network protocol intended
for use by various programs required a non-standard variant of UTF-n,
because I couldn't use standard UTF-n encoding and decoding functions
to interoperate with it.

I indeed don't worry in what way you abuse UTF-n, as long as it's not
an official Unicode standard and it's not widely used in practice.

> If UTC takes 128 unassigned codepoints and declares them to be a new
> set of surrogates, you needn't worry either (your valid data will
> still convert to any UTF).

No, because it would remove responsibility to not generate such data
and add responsibility to accept them, and thus some programs which
are not currently broken would be broken under changed rules.

> Unless you have a strict validator which already validates unpaired
> surrogates. But you don't. I am pretty sure about it.

I use system-supplied iconv() which does not accept anything which can
be described as unpaired surrogates.

> If a user encounters corrupt data and cannot process it with your
> program, she ("she" is 'politically correct', but in this case can
> be seen as sexism) will blame it on the program, not the data.

I don't care.

> This has been discussed mails back. UNIX filenames are already 'submitted'.
> Once you set your locale to UTF-8, you have labelled them all as UTF-8.
> Suggestions?

Convert them to be valid UTF-8 (as long as locales used in the system
use UTF-8 as the encoding, that is, otherwise keep them in the locale's
encoding).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Peter R. Mueller-Roemer: "infinite combinations, was Re: Nicest UTF"
Previous message: Carl W. Brown: "RE: Software support costs (was: Nicest UTF"
In reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 10:50:16 CST