RE: Is it roundtripping or transfer-encoding

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 21 2004 - 07:14:40 CST

  • Next message: Marco Cimarosti: "[OT?] Uniscribe for Malayalam and Oriya"

    Philippe Verdy wrote:
    > No, it won't happen, because Unicode and ISO/IEC-10646
    > already states that
    > it encodes abstract characters.

    I see that as a technicality. What matters are consequences of rules, not
    rules themselves. Consequences of breaking a rule should be analyzed
    (thoroughly and carefully) and if they are acceptable (manageable) and
    usefulness is determined, then rules need to be reinterpreted. So, I think
    UTC needs to interpret the rules, not follow them literally. The rest of us
    should be allowed to try to interpret the rules on our own and make
    suggestions. An attempt to break a rule should not constitute a show-stopper
    for a useful concept. Especially not while analysis of the consequences is
    still in progress.

    >
    > This finally mean that you want these codepoints recognized
    > as characters
    > sometimes,

    And that is exactly how they should be treated by UTFs. And they already
    are. There is no conflict there.

    > but not when you perform the conversion with a
    > transform-encoding-syntax. A transform-encoding-syntax must
    > also not modify
    > the codepoints represented by an encoding scheme (or
    > charset), and UTFs have
    > also the property of having a single representation of these
    > codepoints

    OK, so it introduces a multiple representation of the same codepoints. Every
    escaping technique does that. And it is not a problem. All you need to do is
    define the normalization procedure. And use it where it applies. In many
    cases its use is not even necessary. Specifically, a Unicode system does not
    need to (and should not) normalize the escape codepoints. The need for
    normalization only needs to be determined for an application that uses the
    TES itself, and applies only in few cases.

    CESU-8 has similar problems. If it is misinterpreted as Unicode, it
    self-normalizes if it trips through UTF-16. My data self normalizes if it
    trips through NON-UTF-8 (or shall we call it MUTF-8, Mostly-UTF-8, at the
    risk of being called a mutant:). CESU-8 is slightly simpler, because it
    self-normalizes completely and it can also always be normalized back to a
    CESU-8 representation. My conversion only normalizes partially (it only
    normalizes completely if tripped length/3 times, in the worst case). Also,
    after a full normalization you can no longer tell how many times it was
    escaped in its original form. In real life, this is often acceptable and far
    better than not being able to handle invalid sequences as gracefully as
    MUTF-8 conversion does.

    The above is a loose description of what happens. Not all cases are covered
    systematically. But can be. You can define, for example, that escape
    sequences that normalize to new escape sequences or to invalid sequences in
    UTF-8 are valid (or expected). Those that normalize to other codepoints
    could be considered as invalid, or ill-formed. But again, that only matters
    in a few specific cases. It matters if you'd be handling users this way, but
    doesn't if you are mapping filenames. Even less if one wants to apply this
    technique to editing text files.

    There are two options for using this technique:

    A - You can treat it as 'use it in rare cases'. UTF-8 then remains what it
    is and existing Unicode applications already treat those codepoints exactly
    as they should.

    B - You can start using it wherever you convert to or (well, and) from
    UTF-8. Typically you need to do it in both directions or else you risk
    over-escaping in one case and self-normalization in the other. The latter
    can even be useful in some cases, specifically where graceful handling is
    desired, but roundtripping is not required.

    Now, case B is what I said I would not be trying to do. And that is -
    replacing the UTF-8 conversions with a new conversion. But consequences of
    that can be determined. In the long run it actually reduces the risks of
    over-escaping and self-normalization. The major 'problem' that most people
    brought up is that it threatens to introduce invalid sequences into UTF-8.
    Which would mean that all UTF-8 readers would need to start handling them.
    Perhaps. If they knew how, it wouldn't be that hard anyway. But then again,
    what about the time period when they don't and what if they decide never to?
    Well, does it really matter whether they got it directly from a corrupted
    source or they got it from an application that managed to preserve the data
    and reconstructed it? So, it is not introducing, it is preserving.

    It is a question of signalling or raising an exception. Some applications
    have no way of signalling an error. Signalling "as early as possible" is in
    my opinion an excuse in this case. Signalling should be done at the point
    where user can make decisions and is able to fix it. And even at that point,
    you have users that do want the signalling and you have users that don't
    want it. And the latter are the majority. From the perspective of a
    standardizer that can be seen as unwise. But in real life, usability
    prevails. Did you ever see that a ls command on UNIX would warn you about
    the invalid sequences? Of course not. It would be completely unusable. Well,
    the fact is many UTF-8 decoders (or renderers) don't even use the U+FFFD,
    they simply drop the sequence. Very bad. But no matter how you improve it,
    signalling will never be an option, not in ls, not while rendering. And
    U+FFFD is not a very good option either.

    > If you want to strictly limit the case where escaping of
    > valid characters
    > will happen, the best option you have in Unicode is to use
    > PUAs which are
    > the least likely to happen in original strings (of characters and
    > non-characters), in absence of an explicit agreement.

    Assigning new characters is then even better.

    >
    > Note that a Transfer-Encoding-Syntax, to be usable, requires
    > an explicit
    > mutual agreement to allow the conversion in either direction.

    That explicit agreement is one of the things I am trying to avoid. It can be
    avoided, and that is the intent of standards.

    But I am not so sure this should be called TES after all. It has often been
    suggested or implied that what I do is completely internal and enclosed. But
    that is not true. I started by storing the filenames in UTF-16. But,
    eventually, the filenames can be displayed on Windows. Or created in a
    Windows filesystem (with a few additional restrictions compared to
    displaying, but only those that had already existed before).

    > PUA, or it may even ignore "silently" these PUAs in the
    > rendered graphic,
    > signaling elsewhere to the user that not all characters could
    > be rendered

    I would say, "may, but *only* if it signals". And the same goes for invalid
    sequences. But is not done that way. Far too often. Lots of it will need to
    be fixed. By using U+FFFD? There is a better way. Use 128 new characters.
    You can look on all this from this end. First, allow (and provide a means
    for) renderers to display an invalid UTF-8 sequence (for example in a ls
    command). A useful thing. The rest comes naturally.

    > There are tons of existing TES used everyday in many
    > applications, and none
    > of them required the allocation of distinct codepoints for
    > the encoded
    > strings they generate. Why do you want new characters for
    > this mapping? It's
    > not necessary as demonstrated by all the other existing TES...

    Four reasons:
    1 - Display. Having new characters (or, escape codepoints with their
    appearance defined) allows the text to remain visually similar. Length of
    the text is preserved in many cases, words are easier to read (deduce), and
    line breaks cause less problems. All pretty similar to how mixed encoding
    environments have behaved all this time. No other escaping technique can
    provide this. BTW, U+FFFD can, but is lossy.
    2 - Other escaping techniques do not retain the usual assumption that UTF-16
    is at most twice as big (in bytes) as UTF-8, or MUTF-8. Which can lead to
    bugs and increased memory consumption.
    3 - The PUA solution works well, but has some inherent risks. And cannot be
    standardized.
    4 - Anyone that will encounter the same problem that I have encountered
    might devise a new escaping technique. Adding a few kilos to those tons.

    It was impossible to afford something like assigning any, let alone 128,
    codepoints in a SBCS. Nobody thought of it in MBCS. But they were dealing
    with conversions from SBCS which don't have invalid sequences and have very
    few unassigned positions. And were able to preserve the invalid sequences.
    If UTF-8 would replace them all, we wouldn't need it either, since UTF-8
    also CAN preserve invalid sequences. Well, it would be nice if they could be
    displayed and collated, but perhaps that would even succeed, since there
    would be no other UTFs and the many to one issue would not exist. The
    problem of invalid sequences is a Unicode problem. Not addressing it will
    not make it go away.

    Lars



    This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 07:23:14 CST