From: Arcane Jill (arcanejill@ramonsky.com)
Date: Wed Dec 15 2004 - 08:36:40 CST
Yes, but only if you can have some reasonable assurance that the byte
sequence emitted by UTF(c,x) (where c is the single reserved codepoint you
suggest, and x is U+00xx, the value to be escaped expressed as a character)
will not occur in plain text. This is theoretically checkable - the total
number of legal Unix locales is large, but finite. I don't know how many
there are, but, in principle at least, one could examine each of them in
turn and determine the probability of any given byte sequence occuring in
each locale's encoding.
Another good choice for c would be U+001A, preserving the original meaning
of the old ASCII SUB character. My understanding is that, back in the days
of teletypes, SUB originally caused the following character to be displayed
in red ink instead of black ink, until smarter printers came along, after
which time SUB caused the following character to be selected from an
alternative character set. Of course, all that changed when the 8th bit
started to be used. Now the C0 control codepoints (apart from TAB, CR, LF
and FF) are nothing but an ancient historical legacy which (in my opinion)
could be re-used for something else. (That won't happen, of course, because
of stability guarantees).
But it's the "knowing" part that the problem. Can you really "know" that
such any given byte sequence won't appear in plain text? That's the only
reason I thought of pushing the probability of incorrect identification down
astronomically low.
Jill
-----Original Message-----
From: Peter Kirk [mailto:peterkirk@qaya.org]
Sent: 15 December 2004 12:54
To: Arcane Jill
Cc: Unicode
Subject: Re: Roundtripping Solved
But would it not work just as
well to for Lars' purposes to use, instead of your string of random
characters, just ONE reserved code point followed by U+0xx?
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 08:45:28 CST