Re: Roundtripping Solved

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Wed Dec 15 2004 - 08:36:40 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Yes, but only if you can have some reasonable assurance that the byte
    sequence emitted by UTF(c,x) (where c is the single reserved codepoint you
    suggest, and x is U+00xx, the value to be escaped expressed as a character)
    will not occur in plain text. This is theoretically checkable - the total
    number of legal Unix locales is large, but finite. I don't know how many
    there are, but, in principle at least, one could examine each of them in
    turn and determine the probability of any given byte sequence occuring in
    each locale's encoding.

    Another good choice for c would be U+001A, preserving the original meaning
    of the old ASCII SUB character. My understanding is that, back in the days
    of teletypes, SUB originally caused the following character to be displayed
    in red ink instead of black ink, until smarter printers came along, after
    which time SUB caused the following character to be selected from an
    alternative character set. Of course, all that changed when the 8th bit
    started to be used. Now the C0 control codepoints (apart from TAB, CR, LF
    and FF) are nothing but an ancient historical legacy which (in my opinion)
    could be re-used for something else. (That won't happen, of course, because
    of stability guarantees).

    But it's the "knowing" part that the problem. Can you really "know" that
    such any given byte sequence won't appear in plain text? That's the only
    reason I thought of pushing the probability of incorrect identification down
    astronomically low.

    Jill

    -----Original Message-----
    From: Peter Kirk [mailto:peterkirk@qaya.org]
    Sent: 15 December 2004 12:54
    To: Arcane Jill
    Cc: Unicode
    Subject: Re: Roundtripping Solved

    But would it not work just as
    well to for Lars' purposes to use, instead of your string of random
    characters, just ONE reserved code point followed by U+0xx?



    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 08:45:28 CST