RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 11:34:19 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"

    Kenneth Whistler wrote:
    > Further, as it turns out that Lars is actually asking for
    > "standardizing" corrupt UTF-8, a notion that isn't going to
    > fly even two feet, I think the whole idea is going to be
    > a complete non-starter.

    Technically, I am not asking anything. I am just trying to discuss an
    approach which I think can be used to solve certain problems. And this
    approach does not need to be conformant at this point. If someone finds it
    suitable to make it conformant, even better, but at this point this is
    irrelevant to the discussion. Unless it is proven that it cannot be made
    conformant (by changing or amending the standard) because I have missed an
    important fact. But so far, I have not seen such a proof.

    But suppose I am asking, therefore proposing - it would be several separate
    items:

    1 - To assign codepoints for 128 (or 256) new surrogates(*), used for:
    1.1 - Representing unassigned values when converting from an encoding to
    Unicode (optional).
    1.2 - Representing invalid sequences when interpreting UTF-8 (optional).
    The use of these would not be mandatory. Existing handling is still an
    option and can be preserved wherever it suits the needs, or changed where
    the new behavior is beneficial.

    Representation of these codepoints in UTF-8 would be as per current
    standard.

    2 - An alternative conversion from Unicode, to, say, UTF-8E (UTF-8E is _NOT_
    Unicode(*)).
    This conversion would reconstruct the original byte sequence, from a Unicode
    string obtained by 1.2. This conversion pair intended for use on platform or
    interface boundaries if/where it is determined that they are suitable. For
    example, interfacing UNIX filesystem and a UTF-8 pipe would require
    UTF-8E<=>UTF-8 conversion. Interfacing UNIX filesystem and Windows
    filesystem would require UTF-8E<=>UTF-16 conversion.

    (*) If proposal #2 would not be accepted, then codepoints in proposal #1
    would actually not be surrogates, but simply codepoints and nothing else.
    Even if proposal #2 is accepted, it is still not clear if those should
    really be called surrogates, since they would convert among all UTF's just
    as any other codepoint and only their representation in UTF-8E would differ.
    Note that UTF-8E is not Unicode, but would be standardized in Unicode. IF U
    in UTF is a problem, then any other name can be chosen. Consider it a
    working name and be aware of what it is and is not.

    3 - If UTC cannot agree that BMP should be used for proposal #1, I would
    advise against a decision to assign non-BMP codepoints for the purpose. I
    believe less damage would be done by postponing the decision than by making
    a wrong decision. It is not just about how much disk space or bandwidth is
    used. For example, if both filesystems have a 256 characters limit for a
    filename, limitations are consistent (at least in one direction) if BMP is
    used, and not if any other plane is used.

    4 - If neither of the proposals is accepted, it would be beneficial if UTC
    would manage to preserve at least one suitable block (for example U+A4xx or
    U+ABxx) of 256 codepoints intact to facilitate a future decision.

    Lars Kristan



    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 11:39:32 CST