Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

From: Philippe Verdy (
Date: Fri Dec 17 2004 - 12:19:39 CST

  • Next message: Dean Snyder: "Unicode Ruby"

    Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)Lars
    Kristan wrote:
    > I wrote:
    > > But note that any occurence of U+EE80 to U+EEFF in the
    > > original NON-UTF-8
    > > "text" are escaped, despite they are valid Unicode. However,
    > > choosing U+EE80
    > > to U+EEFF is not a problem because these PUAs are very unlikely to be
    > > present in valid source texts, in absence of a prior PUA-agreement.
    > And would be no problem at all if new codepoints would be assigned for
    > this purpose.

    No, it won't happen, because Unicode and ISO/IEC-10646 already states that
    it encodes abstract characters. What you want is that Unicode allocates a
    new block of 128 codepoints for non-characters.
    There are enough non-characters in Unicode, for use specifically for the
    purpose of allowing internal uses, but not for interchange.

    Unfortunately, all the remaining codepoints are unassigned, meaning that a
    conforming application receiving them must handle them as if they were
    characters. These codepoints are already valid, and the stability of UTFs
    requires that they become convertible between all UTFs (encoding forms or
    encoding schemes), with a unique mapping in all directions for all valid
    code points.

    This finally mean that you want these codepoints recognized as characters
    sometimes, but not when you perform the conversion with a
    transform-encoding-syntax. A transform-encoding-syntax must also not modify
    the codepoints represented by an encoding scheme (or charset), and UTFs have
    also the property of having a single representation of these codepoints
    (Note that SCSU is not an UTF, because it allows multiple representation of
    the same codepoints; it's just an encoding scheme however it preserves the
    uniqueness of codepoints represented by the encoding scheme).

    I really don't think that Unicode needs to allocate codepoints for
    non-characters, because it would also defeat your requirement that all
    conforming applications should accept non-characters (and you already stated
    that you didn't want this to happen). So you're left to using only
    codepoints already assigned to characters.

    That's where transfer-encoding-syntaxes are perfect at work: they map any
    characters or non-characters to a portable string of assigned characters.
    They are not required to change the semantics of the transported characters,
    but they can transform a *character* present in the source string (of
    characters and non-characters) into a *sequence of characters* (yes this is
    called "escaping").

    If you want to strictly limit the case where escaping of valid characters
    will happen, the best option you have in Unicode is to use PUAs which are
    the least likely to happen in original strings (of characters and
    non-characters), in absence of an explicit agreement.

    Note that a Transfer-Encoding-Syntax, to be usable, requires an explicit
    mutual agreement to allow the conversion in either direction. This existence
    of a mutual agreement is exactly what for which PUA were created, so I don't
    see why you should not use them, given that all conforming Unicode
    applications must treat PUAs as valid characters and not as non-characters
    (these applications may have restrictions on which valid characters they
    accept, but then don't expect them to handle all possible internationalized
    plain texts).

    Anyway, it does not matter if the PUAs you choose for your TES comes into
    conflict with PUAs used in a renderer or font: the latter are *other*
    interfaces, with their own private agreement about their usage. A renderer
    which does not know explicitly what is the status of a source PUA must not
    interpret them as if it obeyed the same agreement as the one between the
    renderer and a font. Private agreements are not implicitly transferable and
    not agreed automatically across distinct interfaces (this requires a
    negociation protocol, and some check in the software to see what needs to be
    done with conflicting PUAs obeying to distinct agreement).

    [ The PUAs present in font tables are only there to allow renderers
    accessing font tables, for things like internal conversion of source strings
    of code points to strings of more complex glyphs (such as ligatures or
    contextual form variants). No PUA will pass the working domain of the
    renderer, so a renderer should treat all PUAs present in a source string as
    if they were unknown/unassigned but valid characters, with no glyph (the
    renderer should then display them with an alternate form such as a default
    square replacement glyph, or a highlighted box showing the hex code of the
    PUA, or it may even ignore "silently" these PUAs in the rendered graphic,
    signaling elsewhere to the user that not all characters could be rendered
    graphically -- a conforming signal can be an alert dialog, a text in a
    status bar, a log message on the console, a audible beep, a flashing
    titlebar, a status indicator returned from its API, a warning message drawn
    in the margins of the rendered document,...). ]

    If security is a concern, then choosing PUA is also the best option, because
    the most critical systems will be prepared to handle the case of PUAs, but
    not the case of valid non-PUA characters, which they will let pass through
    by default (notably in absence of an explicit agreement or specification for
    acceptable input strings), as opposed to PUAs where a process concerned by
    security may choose to filter out or substitute by default all possibly
    conflicting input PUAs.

    There are tons of existing TES used everyday in many applications, and none
    of them required the allocation of distinct codepoints for the encoded
    strings they generate. Why do you want new characters for this mapping? It's
    not necessary as demonstrated by all the other existing TES...

    This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 12:36:08 CST