Re: Roundtripping Solved

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 09:24:38 CST

Next message: Mike Ayers: "RE: Roundtripping Solved"

Previous message: Peter Kirk: "Re: Roundtripping Solved"
In reply to: Arcane Jill: "Re: Roundtripping Solved"
Next in thread: Philippe Verdy: "Re: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 16/12/2004 13:12, Arcane Jill wrote:

> I don't think that that last sentence is true. f(), and its
> near-inverse, g(), do not claim to be UTFs, and are functions intended
> to be used only by one particular suite of applications. They are
> therefore nothing to do with Unicode or the UTC (... or even this list
> ! ). ...

But Lars is continuing to insist on 128 reserved characters in the BMP.
That is relevant to the UTC.

He now seems to want to take them from the Yi Extensions block, and
seems to be prepared to take the risk of being assassinated by the Yi,
although not by other nations. Well, I don't know much about the Yi, but
I did find "The Yi have long been known as fierce warriors." They are
not a dead people who can't fight back against being pushed out of the
BMP. And no doubt Michael Everson will also fight fiercely for the Yi
Extensions block. So, be careful, Lars!

> ...The fact that I defined f such that f(s) == utf8decode(s) for all
> valid UTF-8 streams s does not change the status of f() as a purely
> private-use function.
>
> These are the steps I see happening:
> (1) start with an arbitrary octet stream
> (2) "escape" it, using some function (which I have called f), to yield
> a valid UTF-8 stream.
> (3) allow normal Unicode functions round-trip this UTF-8 string
> through UTF-16 (one of Lars' requirements)
> (4) finally, "unescape" the UTF-8 using f's inverse function (which I
> called g) to restore the original octet stream
>
> The escape and unescape functions don't need to be approved by anyone.
> I'm not suggesting they should be part of any standard - they are
> merely a mechanism to ensure that step (3) will hold true.
>
These mechanisms, and any escape mechanism, do not meet the requirement
which I codified as "for all valid UTF-8 strings s8, f(s8) =
UTF-16(s8)". If this is not in fact a requirement, your mechanism can be
made to work, and my logical proof against it fails. But perhaps this is
what Lars means by "They don't translate as UTF-8 would to UTF-16": his
reserved characters would be an exception to "for all valid UTF-8
strings s8, f(s8) = UTF-16(s8)". In principle this is a way ahead.

In what follows, I presume that this is still a requirement.

> Lars's current implementation of this scheme is that his "f" "escapes"
> the binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or
> equivalently, byte x becomes the character U+EE00 + x). He is unhappy
> with this because characters in the range U+EE80 to U+EEFF might be
> found in real text. So you and I have, between us, suggested three
> alternative escaping functions, in an attempt to find an escape
> sequence with a vanishingly small probability of being found in real
> text. I'm not quite sure why Lars isn't happy with these suggestions -
> maybe his goal has still not been clearly stated - but either way,
> since nobody is proposing an amendment to UTFs, it surely isn't the
> business of the UTC.

The problem can be restated quite simply. Valid UTF-8 has a reversible
one-to-one mapping to valid Unicode character sequence, and to valid
UTF-16. If there is a mapping from an "invalid UTF-8" string to a valid
Unicode character sequence, there is also a mapping to that sequence
from a valid UTF-8 string. The mapping "f" is no longer one-to-one but
many-to-one. This implies that there cannot be a reverse mapping "g".
Lars is rightly dissatisfied with any solution which does not guarantee
reversibility.

I note that this argument applies equally to Lars' favoured solution of
128 special characters. If these are valid Unicode characters, they have
a valid UTF-8 representation. Both this representation and the isolated
bytes will be converted by "f" to the same Unicode characters. This
means that "f" is still not one-to-one and so irreversible. That is,
unless Lars is actually proposing a change to the standard UTF-8 mapping
for these characters. And if he is, that is certainly a matter for the
UTC. Or of course if he is abandoning "for all valid UTF-8 strings s8,
f(s8) = UTF-16(s8)".

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Mike Ayers: "RE: Roundtripping Solved"
Previous message: Peter Kirk: "Re: Roundtripping Solved"
In reply to: Arcane Jill: "Re: Roundtripping Solved"
Next in thread: Philippe Verdy: "Re: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 11:04:08 CST