Re: Roundtripping Solved

From: Peter Kirk (
Date: Wed Dec 15 2004 - 11:35:43 CST

  • Next message: Arcane Jill: "Re: Roundtripping Solved"

    On 15/12/2004 14:36, Arcane Jill wrote:

    > Yes, but only if you can have some reasonable assurance that the byte
    > sequence emitted by UTF(c,x) (where c is the single reserved codepoint
    > you suggest, and x is U+00xx, the value to be escaped expressed as a
    > character) will not occur in plain text. This is theoretically
    > checkable - the total number of legal Unix locales is large, but
    > finite. I don't know how many there are, but, in principle at least,
    > one could examine each of them in turn and determine the probability
    > of any given byte sequence occuring in each locale's encoding.

    You don't need this kind of assurance. Suppose my chosen INVALID
    character would normally become <0xpp, 0xqq, 0xrr> according to the
    UTF-8 algorithm, and 0xyy is an octet which cannot be interpreted as
    part of UTF-8.

    My proposed conversion from the NOT-UTF-8 of the filename to NOT-Unicode
    would be that 0xyy is mapped to <INVALID, U+00yy> - which can be
    represented in NOT-UTF-16 and in NOT-UTF-32 (actually maybe in UTF-16
    and UTF-32 if these forms are defined as able to represent the
    noncharacter INVALID). And this conversion is reversible, as long as no
    one attempts to pass noncharacters through it for any other reason.

    Then suppose the NOT-UTF-8 filename includes the octet sequence <0xpp,
    0xqq, 0xrr>. A regular UTF-8 conversion would convert this sequence to
    INVALID, and 0xyy perhaps to REPLACEMENT CHARACTER. But my alternative
    NON-UTF-8 conversion would (as well as converting 0xyy to <INVALID,
    U+00yy>) recognise that the sequence <0xpp, 0xqq, 0xrr> does not
    represent a valid Unicode character (but rather a noncharacter), and so
    convert it to <INVALID, U+00pp, INVALID, U+00qq, INVALID, U+00rr>. This
    conversion is reversible.

    I think that meets the requirement that g(f(b)) == b for all b. It also
    requires a little extra complexity in my NON-UTF-8 conversion to reject
    conversion of noncharacters.

    This is not reversible in the other direction, for f(g(a)) != a for all
    a. For example <INVALID, U+0020> becomes 0x20 in NON-UTF-8 which of
    course is converted back to simply U+0020; or else it becomes <0xpp,
    0xqq, 0xrr, 0x20> which is converted back to <INVALID, U+00pp, INVALID,
    U+00qq, INVALID, U+00rr, U+0020>. But Lars confirmed that this is not a

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 13:04:05 CST