From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 17 2004 - 12:19:39 CST
Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)Lars
Kristan wrote:
> I wrote:
> > But note that any occurence of U+EE80 to U+EEFF in the
> > original NON-UTF-8
> > "text" are escaped, despite they are valid Unicode. However,
> > choosing U+EE80
> > to U+EEFF is not a problem because these PUAs are very unlikely to be
> > present in valid source texts, in absence of a prior PUA-agreement.
> And would be no problem at all if new codepoints would be assigned for
> this purpose.
No, it won't happen, because Unicode and ISO/IEC-10646 already states that
it encodes abstract characters. What you want is that Unicode allocates a
new block of 128 codepoints for non-characters.
There are enough non-characters in Unicode, for use specifically for the
purpose of allowing internal uses, but not for interchange.
Unfortunately, all the remaining codepoints are unassigned, meaning that a
conforming application receiving them must handle them as if they were
characters. These codepoints are already valid, and the stability of UTFs
requires that they become convertible between all UTFs (encoding forms or
encoding schemes), with a unique mapping in all directions for all valid
code points.
This finally mean that you want these codepoints recognized as characters
sometimes, but not when you perform the conversion with a
transform-encoding-syntax. A transform-encoding-syntax must also not modify
the codepoints represented by an encoding scheme (or charset), and UTFs have
also the property of having a single representation of these codepoints
(Note that SCSU is not an UTF, because it allows multiple representation of
the same codepoints; it's just an encoding scheme however it preserves the
uniqueness of codepoints represented by the encoding scheme).
I really don't think that Unicode needs to allocate codepoints for
non-characters, because it would also defeat your requirement that all
conforming applications should accept non-characters (and you already stated
that you didn't want this to happen). So you're left to using only
codepoints already assigned to characters.
That's where transfer-encoding-syntaxes are perfect at work: they map any
characters or non-characters to a portable string of assigned characters.
They are not required to change the semantics of the transported characters,
but they can transform a *character* present in the source string (of
characters and non-characters) into a *sequence of characters* (yes this is
called "escaping").
If you want to strictly limit the case where escaping of valid characters
will happen, the best option you have in Unicode is to use PUAs which are
the least likely to happen in original strings (of characters and
non-characters), in absence of an explicit agreement.
Note that a Transfer-Encoding-Syntax, to be usable, requires an explicit
mutual agreement to allow the conversion in either direction. This existence
of a mutual agreement is exactly what for which PUA were created, so I don't
see why you should not use them, given that all conforming Unicode
applications must treat PUAs as valid characters and not as non-characters
(these applications may have restrictions on which valid characters they
accept, but then don't expect them to handle all possible internationalized
plain texts).
Anyway, it does not matter if the PUAs you choose for your TES comes into
conflict with PUAs used in a renderer or font: the latter are *other*
interfaces, with their own private agreement about their usage. A renderer
which does not know explicitly what is the status of a source PUA must not
interpret them as if it obeyed the same agreement as the one between the
renderer and a font. Private agreements are not implicitly transferable and
not agreed automatically across distinct interfaces (this requires a
negociation protocol, and some check in the software to see what needs to be
done with conflicting PUAs obeying to distinct agreement).
[ The PUAs present in font tables are only there to allow renderers
accessing font tables, for things like internal conversion of source strings
of code points to strings of more complex glyphs (such as ligatures or
contextual form variants). No PUA will pass the working domain of the
renderer, so a renderer should treat all PUAs present in a source string as
if they were unknown/unassigned but valid characters, with no glyph (the
renderer should then display them with an alternate form such as a default
square replacement glyph, or a highlighted box showing the hex code of the
PUA, or it may even ignore "silently" these PUAs in the rendered graphic,
signaling elsewhere to the user that not all characters could be rendered
graphically -- a conforming signal can be an alert dialog, a text in a
status bar, a log message on the console, a audible beep, a flashing
titlebar, a status indicator returned from its API, a warning message drawn
in the margins of the rendered document,...). ]
If security is a concern, then choosing PUA is also the best option, because
the most critical systems will be prepared to handle the case of PUAs, but
not the case of valid non-PUA characters, which they will let pass through
by default (notably in absence of an explicit agreement or specification for
acceptable input strings), as opposed to PUAs where a process concerned by
security may choose to filter out or substitute by default all possibly
conflicting input PUAs.
There are tons of existing TES used everyday in many applications, and none
of them required the allocation of distinct codepoints for the encoded
strings they generate. Why do you want new characters for this mapping? It's
not necessary as demonstrated by all the other existing TES...
This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 12:36:08 CST