Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Dec 17 2004 - 10:07:29 CST

  • Next message: Peter Kirk: "Re: Roundtripping Solved"

    Philippe Verdy wrote:

    > What Lars wants has a name: it's a
    > "transfer-encoding-syntax", to allow
    > transporting any code unit sequences into a more restricted
    > environment.
    > This is not a new thing, but this is not specified by Unicode.

    Good. It is a known thing. Which also means we can use previous experience
    with transfer-encoding-syntaxes. For example, what are the security
    implications and how they can be dealt with.

    > But note that any occurence of U+EE80 to U+EEFF in the
    > original NON-UTF-8
    > "text" are escaped, despite they are valid Unicode. However,
    > choosing U+EE80
    > to U+EEFF is not a problem because these PUAs are very unlikely to be
    > present in valid source texts, in absence of a prior PUA-agreement.

    And would be no problem at all if new codepoints would be assigned for this
    purpose.

    > Remember that this is only a Transform-Encoding-Syntax, not a
    > new encoding.
    > It does not require ANY new codepoint allocation by Unicode!

    But does not mean there are no benefits in doing so. Escape characters are
    always a pain, like your example of """. OK, the next step is to assign
    a new codepoint for this purpose. SBCS had little room, the need was not
    recognised early enough and even if it would, people would use the escape
    character simply because they would like the way it would display. With
    (less than) 255 glyphs to choose from, people were bound to use them all.
    But Unicode has A LOT of codepoints, so it makes sense to do something like
    that.

    At some point, someone thought of mapping bytes in invalid sequences to
    codepoints. Didn't know how to call them (or perhaps called them replacement
    characters), but UTC thought such codepoints shouldn't be assigned. But, if
    we call it "Transform-Encoding-Syntax" instead of "conversion", then they
    should be called "escape characters" instead of "replacement characters".
    And for the first time in history, you have an escaping method with more
    than one escape character. Very efficient. Very compact. Very
    straightforward. And Unicode is the one encoding that has both enough
    codepoints to afford it and at the same time more need for it than any other
    encoding.

    One can compare it with MBCSs, and say the same thing could be done there
    but wasn't. But actually there was less need for it. Many SBCSs have no
    unassigned codepoints, and MBCSs were too busy with their own problems to
    worry about cross-compatibility at this level. But Unicode has learned a lot
    from mistakes made there, and can be better in every aspect. Shouldn't it
    be?

    Anyway, if a very good Transform-Encoding-Syntax is devised, UTC could
    recognise the fact that everyone would benefit from it. If it means
    assigning 128 codepoints, then that is the price. And one can hardly say it
    has nothing to do with Unicode. It uses Unicode for transport. And Unicode
    can benefit from it itself.

    Lars



    This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 10:14:43 CST