From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 15 2004 - 11:35:43 CST
On 15/12/2004 14:36, Arcane Jill wrote:
> Yes, but only if you can have some reasonable assurance that the byte
> sequence emitted by UTF(c,x) (where c is the single reserved codepoint
> you suggest, and x is U+00xx, the value to be escaped expressed as a
> character) will not occur in plain text. This is theoretically
> checkable - the total number of legal Unix locales is large, but
> finite. I don't know how many there are, but, in principle at least,
> one could examine each of them in turn and determine the probability
> of any given byte sequence occuring in each locale's encoding.
You don't need this kind of assurance. Suppose my chosen INVALID
character would normally become <0xpp, 0xqq, 0xrr> according to the
UTF-8 algorithm, and 0xyy is an octet which cannot be interpreted as
part of UTF-8.
My proposed conversion from the NOT-UTF-8 of the filename to NOT-Unicode
would be that 0xyy is mapped to <INVALID, U+00yy> - which can be
represented in NOT-UTF-16 and in NOT-UTF-32 (actually maybe in UTF-16
and UTF-32 if these forms are defined as able to represent the
noncharacter INVALID). And this conversion is reversible, as long as no
one attempts to pass noncharacters through it for any other reason.
Then suppose the NOT-UTF-8 filename includes the octet sequence <0xpp,
0xqq, 0xrr>. A regular UTF-8 conversion would convert this sequence to
INVALID, and 0xyy perhaps to REPLACEMENT CHARACTER. But my alternative
NON-UTF-8 conversion would (as well as converting 0xyy to <INVALID,
U+00yy>) recognise that the sequence <0xpp, 0xqq, 0xrr> does not
represent a valid Unicode character (but rather a noncharacter), and so
convert it to <INVALID, U+00pp, INVALID, U+00qq, INVALID, U+00rr>. This
conversion is reversible.
I think that meets the requirement that g(f(b)) == b for all b. It also
requires a little extra complexity in my NON-UTF-8 conversion to reject
conversion of noncharacters.
This is not reversible in the other direction, for f(g(a)) != a for all
a. For example <INVALID, U+0020> becomes 0x20 in NON-UTF-8 which of
course is converted back to simply U+0020; or else it becomes <0xpp,
0xqq, 0xrr, 0x20> which is converted back to <INVALID, U+00pp, INVALID,
U+00qq, INVALID, U+00rr, U+0020>. But Lars confirmed that this is not a
requirement.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 13:04:05 CST