Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Dec 11 2004 - 08:08:32 CST

Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

Previous message: Lars Kristan: "Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)"
In reply to: Lars Kristan: "Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)"
Next in thread: Arcane Jill: "Re: Roundtripping in Unicode"
Maybe reply: Arcane Jill: "Re: Roundtripping in Unicode"
Maybe reply: Kenneth Whistler: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan <lars.kristan@hermes.si> writes:

> The other name for this is roundtripping. Currently, Unicode allows
> a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
> several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more
> valuable, even if it means that the other roundtrip is no longer
> guaranteed:

It's essential that any UTF-n can be translated to any other without
loss of data. Because it allows to use an implementation of the given
functionality which represents data in any form, not necessarily the
form we have at hand, as long as correctness is concerned. Avoiding
conversion should matter only for efficiency, not for correctness.

> Let me go a bit further. A UTF-16=>UTF-8=>UTF-16 roundtrip is only
> required for valid codepoints other than the surrogates. But it also
> works for surrogates unless you explicitly and intentionally break it.

Unpaired surrogates are not valid UTF-16, and there are no surrogates
in UTF-8 at all, so there is no point in trying to preserve UTF-16
which is not really UTF-16.

> I would opt for the latter (i.e. keep it working), according to my
> statement (in the thread "When to validate") that validation should
> be separated from other processing, where possible.

Surely it should be separated: validation is only necessary when data
are passed from the external world to our system. Internal operations
should not produce invalid data from valid data. You don't have to
check at each point whether data is valid. You can assume that it is
always valid, as long as the combination of the programming language,
libraries and the program is not broken.

Some languages make it easier to ensure that strings are valid, to the
point that they guarantee it (they don't offer any way to construct
an invalid string). Unfortunately many languages don't: they say that
they represent strings in UTF-8 or UTF-16, but they are unsafe, they
do nothing to prevent constructing an array of words which is not
valid UTF-8 or UTF-16 and passing it to functions which assume that
it is. Blame these languages, not the definitions of UTF-n.

> A UTF-32=>UTF-8=>UTF-32 roundtrip is similar, except that 16-8-16 works even
> with concatenation, while 32-8-32 can be broken with concatenation.

It always works as long as data was really UTF-32 at the first place.
A word with a value of 0x0000D800 is not UTF-32.

> All this is known and presents no problems, or - only problems that
> can be kept under control. So, by introducing another set of 128
> 'surrogates', we don't get a new type of a problem, just another
> instance of a well known one.

Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would
like to break this. No way.

> On the other hand, UTF-8=>UTF-16=>UTF-8 as well as UTF-8=>UTF-32=>UTF-8
> can be both achieved, with no exceptions. This is something no other
> roundtrip can offer at the moment.

But they do! An isolated byte with the highest bit set is not UTF-8,
so there is no point in converting it to UTF-16 and back.

> On top of it, I repeatedly stressed that it is UTF-8 data that has the
> highest probablility of any of the following:
> * contains portions that are not UTF-8
> * is not really UTF-8, but user has UTF-8 set as default encoding
> * is not really UTF-8, but was marked as such
> * a transmission error not only changes data but also creates invalid
> sequences

In this cases the data is broken and the damage should be signalled as
soon as possible, so the submitter can know this and correct it.

Alternatively you keep the original byte sequence, but don't pretend
that it's UTF-8. Delete the erroneous UTF-8 label instead of changing
the data.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Previous message: Lars Kristan: "Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)"
In reply to: Lars Kristan: "Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)"
Next in thread: Arcane Jill: "Re: Roundtripping in Unicode"
Maybe reply: Arcane Jill: "Re: Roundtripping in Unicode"
Maybe reply: Kenneth Whistler: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 08:14:35 CST