RE: [icu-support] complete binary/utf mapping

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 12 2007 - 06:00:29 CDT

  • Next message: Philippe Verdy: "RE: Where is the First> Last> convention documented?"

    Mark E. Shoulson wrote:
    > Doug Ewell wrote:
    >
    > > I'll see if I can find the thread where we talked about that, years ago.
    > >
    > > Somebody wanted to build that capability into an extension to UTF-8,
    > > so it could faithfully represent invalid garbage. We were never able
    > > to get him to work through what he wanted to do with the garbage thus
    > > preserved.
    > >
    > Is there an obvious reason we couldn't just treat the garbage UTF-8 as a
    > string of 8-bit characters (might be part of a binary file or something)
    > and base-64 encode them? That'll definitely preserve round-trippedness.

    Given the very strict conformance requirements in UTF-8, this leaves enough
    encoding space for such an extension without colliding with standard UTF-8.
    So such an extension is clearly possible in many ways, as long as the UTF-8
    strict conformance requirements are kept, without breaking compatibility
    with UTF-8 itself.

    What this means is that UTF-8 does not need to be extended itself, and it
    should not: the extension will be private, and would need to be labelled
    differently.

    However, I would not use UTF-8 for such thing. If one wants to represent
    invalid UTF-8 sequences unchanged, the best thing that can be done is just
    to do NOTHING about these sequences, but relabel the text with a distinct
    charset identifier like "invalid-UTF-8" which will just allow all valid
    UTF-8 sequences plus some binary sequences of bytes which are not valid
    UTF-8, and in such extension charset, treat each byte as if it was a
    non-Unicode codepoint, like U+110000 plus the value of each byte.

    Transcoding it to strict UTF-32 would be impossible but transcoding it to
    "invalid-UTF-32" would be extremely basic. Transcoding it to UTF-16 would
    also be impossible but could be made using sequences forbidden in standard
    UTF-16, such as an unpaired surrogate.

    Document parsers using these "invalid-UTF-8" or "invalid-UTF-16" or
    "invalid-UTF-32" charsets should then need specific character parsers to
    recognize the invalid sequences and treat them as distinctive objects
    (similar but not equivalent to valid characters). But how these documents
    parsers will treat these pseudo-characters is left to implementations, and
    Unicode does not need to be updated. It's then up to applications to decide
    how to treat these objects, exactly like it is left to applications to
    decide what to do with documents that are not correctly encoded with the
    standard tagged UTF.

    What is clear is that these documents won't be portable across systems in a
    heterogeneous system, but such use is still possible locally and their
    acceptation, use or non use, is left to applications and developers, as long
    as they don't pretend that the output of the programs accepting these
    documents is not tagged as UTF-8 or other standard UTF if it still contains
    the detected invalid sequences after internal processing of the accepted
    input document.

    As long as such application can still correctly signal to the user that the
    input documents were accepted, despite they had invalid sequences, this will
    remain safe (however it won't be safe if the input document was explicitly
    tagged with a standard UTF charset name, and the application accepts the
    documents and treats it silently, producing valid UTF without signaling the
    interpretation caveat to the user.)

    For example an input filter could be invoked with:

            $ someFilter -inputcharset "UTF-8" -outputcharset "UTF-8"
    < someDocument.txt > result.UTF-8.txt

    It should not generate the expected output (it will signal an error on
    output) if there are invalid UTF-8 sequences in the input document, but the
    same filter program could be built to accept:

            $ someFilter -inputcharset "x-invalid-UTF-8" -outputcharset "UTF-8"
    < someDocument.txt > result.UTF-8.txt

    with the same input document and produce a perfectly valid standard UTF-8
    output in "result.UTF-8.txt" from this input because it does not pretend
    that "somedocument.txt" is a standard UTF-8 text (so this filter is still
    fully conforming to Unicode, which does not dictate how filters should treat
    other charsets that are not standard UTF's).

    Unicode will not say how the non standard "extension" charset should be
    named. In fact many schemes are possible and each one defines a new separate
    charset; my opinion is that such "extension" charset should completely avoid
    containing "UTF" in their names to avoid confusions.



    This archive was generated by hypermail 2.1.5 : Wed Sep 12 2007 - 06:03:04 CDT