Re: [icu-support] complete binary/utf mapping

From: Doug Ewell (dewell@roadrunner.com)
Date: Fri Sep 14 2007 - 18:00:21 CDT

  • Next message: Mike: "Re: [icu-support] complete binary/utf mapping"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Given the very strict conformance requirements in UTF-8, this leaves
    > enough encoding space for such an extension without colliding with
    > standard UTF-8. So such an extension is clearly possible in many ways,
    > as long as the UTF-8 strict conformance requirements are kept, without
    > breaking compatibility with UTF-8 itself.

    It's logically impossible to extend a specification that explicitly
    leaves no room for extension, such as UTF-8, and simultaneously "keep
    the conformance requirements" of that specification.

    > What this means is that UTF-8 does not need to be extended itself, and
    > it should not: the extension will be private, and would need to be
    > labelled differently.

    If it's truly private, then there should be no need to label it at all.
    People who want to use it can just send and receive it according to
    their private agreement, and make sure it never leaks out into the wild.
    That's what should have happened with CESU-8.

    > However, I would not use UTF-8 for such thing. If one wants to
    > represent invalid UTF-8 sequences unchanged, the best thing that can
    > be done is just to do NOTHING about these sequences, but relabel the
    > text with a distinct charset identifier like "invalid-UTF-8" which
    > will just allow all valid UTF-8 sequences plus some binary sequences
    > of bytes which are not valid UTF-8, and in such extension charset,
    > treat each byte as if it was a non-Unicode codepoint, like U+110000
    > plus the value of each byte.

    This might seem like a good idea to someone with no background in
    Unicode, and I would expect a newcomer to this list to propose a
    mechanism like this.

    The problem is that this has already been proposed and tried, and the
    result is always the same: the new "invalid UTF-8" or "extended UTF-8"
    format looks too much like real UTF-8, and causes problems when people
    try to generate or parse it with conformant UTF-8 tools, as they
    invariably will. Character-set labels don't help much, especially not
    private ones; parsers that don't recognize a label will attempt to
    figure out the encoding heuristically.

    In any case, the suggestion to label invalid UTF-8 implies that the text
    can be labeled. Not all text is in XML or HTML or some other kind of
    ML. Extended formats are a Bad Thing. New encodings, if any, should
    not "look like" any other encoding; in particular, the representation of
    U+FEFF must be different.

    Nobody has yet answered the question of *what good* it would do to
    preserve binary garbage in a UTF-8 or UTF-8-like text stream. Do we
    impose a similar requirement on other Unicode text encodings, or other
    non-Unicode text encodings?

    > Transcoding it to strict UTF-32 would be impossible but transcoding it
    > to "invalid-UTF-32" would be extremely basic. Transcoding it to UTF-16
    > would also be impossible but could be made using sequences forbidden
    > in standard UTF-16, such as an unpaired surrogate.

    See above. These are eminently possible, eminently bad ideas.

    > Document parsers using these "invalid-UTF-8" or "invalid-UTF-16" or
    > "invalid-UTF-32" charsets should then need specific character parsers
    > to recognize the invalid sequences and treat them as distinctive
    > objects (similar but not equivalent to valid characters).

    Then let the private users of these encodings build these parsers and
    use them "in their own backyard," as Roman Czyborra once put it. Don't
    expose them to the rest of us.

    > What is clear is that these documents won't be portable across systems
    > in a heterogeneous system, but such use is still possible locally and
    > their acceptation, use or non use, is left to applications and
    > developers, as long as they don't pretend that the output of the
    > programs accepting these documents is not tagged as UTF-8 or other
    > standard UTF if it still contains the detected invalid sequences after
    > internal processing of the accepted input document.

    See above. Don't do it.

    Here is how bad it is to create an "extended" version of a standard
    encoding: Ten years ago, ACAP invented a format called "Multi-Lingual
    String Format (MLSF)" that used invalid UTF-8-like sequences to encode
    language tags. They wrote an Internet-Draft [1] specifying the format,
    and they were careful to fill the I-D with clear warnings like this:

    "Note that MLSF is not compatible with UTF-8. A program which uses MLSF
    MUST downconvert it to UTF-8 prior to using it in a context where UTF-8
    is required. Sample code for this down conversion is included in
    Appendix B."

    Despite this, this approach was considered so bad that Unicode quickly
    devised an alternative, conformant mechanism to accomplish the same
    thing -- Plane 14 tags -- and then *immediately* discouraged the use of
    this new mechanism except in the most extraordinary of circumstances.
    Even today, 10 years later, a group that has a use case for Plane 14
    tags very similar to that of ACAP is being told that "using them is
    still a really bad idea" [2]. As bad as that sounds, Plane 14 tags are
    still VASTLY BETTER than a new UTF-8 lookalike encoding.

    Don't do it.

    [1] http://xml.coverpages.org/draft-ietf-acap-mlsf-01.txt
    [2] http://www1.ietf.org/mail-archive/web/ltru/current/msg08497.html

    --
    Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Fri Sep 14 2007 - 18:03:24 CDT