From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 15 2004 - 06:54:04 CST
On 15/12/2004 11:11, Arcane Jill wrote:
> I followed (and understood) Lar's explanation as to why the NOT-xxxx
> solution wouldn't work for him. Shame really - but here's another bash
> at a solution, again without breaking the Unicode model. If I have
> understood this correctly, these are Lars' requirements:
>
> 1) There exists a function, f(), which maps an arbitrary octet stream
> to a sequence of Unicode characters
> 2) A required property of f() is that, if any substring of its input
> is valid UTF-8, then f() must convert that substring to the sequence
> of Unicode characters which would have been obtained by UTF-8 itself.
> 3) There exists an inverse function, g(), such that g(a) == b if and
> only if f(b) == a.
Lars seems to have extended the requirement here such that a can be any
sequence of 16-bit words, just as b can be any sequence of octets, i.e.
he requires not only that g(f(b)) == b for all b, but also that f(g(a))
== a for all a. That may makes things much harder! There is at least a
need to deal with unpaired surrogates.
>
> As Unicoders have pointed out, these goals appear to be mutually
> contradictory, unless we assume the following corrollory, which I
> shall call "requirement 4".
>
> 4) A second required property of f() is that, if any octet of its
> input is not part of a valid UTF-8 substring, then f() must convert
> that octet to a Unicode character string /which cannot possibly appear
> in Unicode plain text/.
>
> It is for reasons of requirement (4) that Lars proposes the
> introduction of 128 BMP codepoints. His intention is that they be
> marked as "reserved - do not use", so that requirement 4 is met.
> Naturally, this proposal has met with a lot of resistance, and almost
> certainly would never get approved by the UC. Therefore, I propose an
> alternative solution, as follows:
>
> ...
>
> Now everything will work. Unicode is not broken. All UTFs are
> interchangeable as before; Lars's "escape aware" applications can use
> the functions f() and g() instead of UTF-8 transformations; all other
> Unicode applications will retain Lars's data uncorrupted, and he can
> "unescape" it (that is, apply function g()) at the appropriate time to
> recover the original data.
>
> That do?
> Jill
>
Jill, again your solution is ingenious. But would it not work just as
well to for Lars' purposes to use, instead of your string of random
characters, just ONE reserved code point followed by U+0xx? Instead of
asking the UTC to allocate a specific code point for this (which it
probably will not do), he can use either U+FFFE or U+FFFF, which "are
intended for process internal uses, but are not permitted for
interchange." Let's call the one non-character chosen INVALID.
Of course a problem arises if the original filename consists of a string
which is the UTF-8 representation of INVALID. Does this in fact count as
valid UTF-8? (If it does, an alternative might be to use an unpaired
surrogate for INVALID, because the UTF-8 representation of a surrogate
is invalid UTF-8.) Even if it does, it does not represent valid Unicode,
and so the conversion routine can convert the UTF-8 for INVALID as if it
was three invalid bytes.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 10:45:57 CST