RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 08:38:55 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Arcane Jill wrote:
    > I've been following this thread for a while, and I've pretty
    Thanks for bearing with me. And I hope my response will not discourage you
    from continuing to do so. That is, until I am banned from the list for
    heresy.

    > much got the
    > hang of the issues here. To summarize:
    >
    > Unix filenames consist of an arbitrary sequence of octets,
    > excluding 0x00
    > and 0x2F. How they are /displayed/ to any given user depends
    > on that user's
    > locale setting. In this scenario, two users with different
    > locale settings
    > will see different filenames for the same file, but they will
    > still be able
    > to access the file via the filename that they see. These two
    > filenames will
    > be spelt identically in terms of octets, but (apparently)
    > differently when
    > viewed in terms of characters.
    >
    > At least, that's how it was until the UTF-8 locale came along. If we
    I think such problems were already present with Shift-JIS. But already
    stated once why this was not noticed and will not repeat myself, unless
    explicitly asked to do so.

    > consider only one-byte-per-character encodings, then any
    > octet sequence is
    > "valid" in any locale. But UTF-8 introduces the possibility
    > that an octet
    > sequence might be "invalid" - a new concept for Unix. So if
    > you change your
    > locale to UTF-8, then suddenly, some files created by other
    > users might
    > appear to you to have invalid filenames (though they would
    > still appear
    > valid when viewed by the file's creator).
    >
    > A specific example: if a file F is accessed by two different
    > users, A and B,
    > of whom A has set their locale to Latin-1, and B has set
    > their locale to
    > UTF-8, then the filename may appear to be valid to user A,
    > but invalid to
    > user B.
    >
    > Lars is saying (and he's probably right, because he knows
    > more about Unix
    > than I) that user B does not necessarily have the right to
    > change the actual
    > octet sequence which is the filename of F, just to make it
    > appear valid to
    > user B, because doing so would stop a lot of things working
    > for user A (for
    > instance, A might have created the file, the filename might
    > be hardcoded in
    > a script, etc.). So Lars takes a Unix-like approach, saying
    > "retain the
    > actual octet sequence, but feel free to try to display and
    > manipulate it as
    > if it were some UTF-8-like encoding in which all octet
    > sequences are valid".
    > And all this seems to work fine for him, until he tries to
    > roundtrip to
    > UTF-16 and back.
    >
    > I'm not sure why anyone's arguing about this though -
    > Phillipe's suggestion
    > seems to be the perfect solution which keeps everyone happy. So...
    Well, it doesn't. The rest of my comments will show you why.

    >
    > ...allow me to construct a specific example of what Phillipe
    > suggested only
    > generally:
    >
    > DEFINITION - "NOT-Unicode" is the character repertoire
    > consisting of the
    > whole of Unicode, and 128 additional characters representing
    > integers in the
    > range 0x80 to 0xFF.
    As long as we agree that the codepoints used to store the NOT-Unicode data
    are valid unicode codepoints. You noticed yourself that NOT-Unicode should
    roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed
    through UTF-16.

    >
    > OBSERVATION - Unicode is a subset of NOT-Unicode
    But unfortunately data can pass from NOT-Unicode to Unicode. Some people
    think that this is terribly bad. One would think that by storing NOT-UTF-8
    in NOT-UTF-16 would prevent data from crossing the boundary, but that is not
    so.

    >
    > DEFINITION - "NOT-UTF-8" is a bidirectional encoding between
    > a NOT-Unicode
    > character stream and an octet stream, defined as follows: if
    > a NOT-Unicode
    > character is a Unicode character then its encoding is the
    > UTF-8 encoding of
    > that character; else the NOT-Unicode character must represent
    > an integer, in
    > which case its encoding is itself. To decode, assume the next
    > NOT-Unicode
    > character is a Unicode character and attempt to decode from
    > the octet stream
    > using UTF-8; if this fails then the NOT-Unicode character is
    > an integer, in
    > which case read one single octet from the stream and return it.
    More or less. You have not defined how to return the octet. It must be
    returned as a valid Unicode codepoint. And if a Unicode character is
    decoded, one must check if it is any of the codepoints used for this purpose
    and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains
    unchanged.

    >
    > OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
    Yes, that's the sanity check, because this is what we wanted to get.

    >
    > OBSERVATION - NOT-Unicode characters which are Unicode
    > characters will be
    > encoded identically in UTF-8 and NOT-UTF-8
    Unfortunately not so. Becase you started with the wrong assumption that
    NOT-UTF-8 data will not be stored in valid codepoints. But the fact that
    this observation is not true is not really a problem.

    >
    > OBSERVATION - NOT-Unicode characters which are not Unicode
    > characters cannot
    > be represented in UTF-8
    They should be. Being able to pass the NOT-Unicode characters to UTF-16 is
    just the most difficult part. If you pass data to an UTF-16 application, you
    have no way of knowing if it will chose to convert the data to UTF-32 or
    UTF-8 for a certain portion of processing before returning the changed or
    unchanged result, again in UTF-16. NOT-Unicode characters must be
    representable in all UTF formats. Hence, they need to be valid Unicode
    codepoints.

    >
    > DEFINITION - "NOT-UTF-16" is a bidirectional encoding between
    > a NOT-Unicode
    > character stream and a 16-bit word stream, defined as follows: if a
    > NOT-Unicode character is a Unicode character then its
    > encoding is the UTF-16
    > encoding of that character; else the NOT-Unicode character
    > must represent an
    > integer, in which case its encoding is 0xDC00 plus the
    > integer. To decode,
    > if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the
    > NOT-Unicode character is the integer whose value is (word16 -
    > 0xDC00), else
    > the NOT-Unicode character is the Unicode character obtained
    > by decoding as
    > if UTF-16.
    I think this is called UTF-8B conversion. It satisfies all the requirements
    except for the fact it uses unpaired surrogatesm which are not valid
    codepoints.

    >
    > OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
    > NOT-UTF-16 -> NOT-UTF-8
    Yes, this is close to what we need. We need NOT-UTF-8 -> UTF-16 ->
    NOT-UTF-8. We just need to agree that instead of 0xDC00 some other range
    must be used.

    > This would appear to solve Lars' problem, and because the
    > three encodings,
    > NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs,
    > no-one need
    > get upset.
    >
    > I /think/ that will work.

    So, no, unfortunately it doesn't work. I proposed this solution two years
    ago. And it was also proposed many years ago by other people. It has two
    problems:

    1 - Using unpaired surrogates introduces a danger of corrupting the
    NON-Unicode data. In a case where an unvalidated UTF-16 string contains an
    unpaired high surrogate and is concatenated with an NON-UTF-16 string that
    begins with an unpaired low surrogate, representing a NON-Unicode character.
    Chosing a valid codepoint sequence instead of unpaired low surrogates avoids
    that risk (no matter how unlikely it is).

    2 - If I would want to use this approach, it would mean that I would be
    limited only to applications that would adopt this approach, that is, at
    least not validate unpaired low surrogates. Currently, Unicode standard
    defines that unpaired surrogates are invalid data. A Unicode compliant
    application may (not 'must', at least in my opition) reject such data at any
    time. Changing such a fundamental directive is a problem on its own. And I
    cannot blame UTC for not considering it. Especially since due to (1), it is
    not a good solution anyway. Even if it would be considered and accepted, it
    would take ages before applications would obey it. Until then, I cannot use
    them. If an approach that uses valid codepoints is adopted, it can be used
    as soon as the codepoints are defined. No existing application needs to (nor
    should) change the behavior, unless they start using the new conversion
    themselves. Which is not true if they simply receive UTF-8 data that was
    obtained via this conversion by some other application.

    Lars



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 08:42:31 CST