RE: Roundtripping in Unicode

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Dec 14 2004 - 05:32:12 CST

  • Next message: Lars Kristan: "Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"

    I've been following this thread for a while, and I've pretty much got the
    hang of the issues here. To summarize:

    Unix filenames consist of an arbitrary sequence of octets, excluding 0x00
    and 0x2F. How they are /displayed/ to any given user depends on that user's
    locale setting. In this scenario, two users with different locale settings
    will see different filenames for the same file, but they will still be able
    to access the file via the filename that they see. These two filenames will
    be spelt identically in terms of octets, but (apparently) differently when
    viewed in terms of characters.

    At least, that's how it was until the UTF-8 locale came along. If we
    consider only one-byte-per-character encodings, then any octet sequence is
    "valid" in any locale. But UTF-8 introduces the possibility that an octet
    sequence might be "invalid" - a new concept for Unix. So if you change your
    locale to UTF-8, then suddenly, some files created by other users might
    appear to you to have invalid filenames (though they would still appear
    valid when viewed by the file's creator).

    A specific example: if a file F is accessed by two different users, A and B,
    of whom A has set their locale to Latin-1, and B has set their locale to
    UTF-8, then the filename may appear to be valid to user A, but invalid to
    user B.

    Lars is saying (and he's probably right, because he knows more about Unix
    than I) that user B does not necessarily have the right to change the actual
    octet sequence which is the filename of F, just to make it appear valid to
    user B, because doing so would stop a lot of things working for user A (for
    instance, A might have created the file, the filename might be hardcoded in
    a script, etc.). So Lars takes a Unix-like approach, saying "retain the
    actual octet sequence, but feel free to try to display and manipulate it as
    if it were some UTF-8-like encoding in which all octet sequences are valid".
    And all this seems to work fine for him, until he tries to roundtrip to
    UTF-16 and back.

    I'm not sure why anyone's arguing about this though - Phillipe's suggestion
    seems to be the perfect solution which keeps everyone happy. So...

    ...allow me to construct a specific example of what Phillipe suggested only
    generally:

    DEFINITION - "NOT-Unicode" is the character repertoire consisting of the
    whole of Unicode, and 128 additional characters representing integers in the
    range 0x80 to 0xFF.

    OBSERVATION - Unicode is a subset of NOT-Unicode

    DEFINITION - "NOT-UTF-8" is a bidirectional encoding between a NOT-Unicode
    character stream and an octet stream, defined as follows: if a NOT-Unicode
    character is a Unicode character then its encoding is the UTF-8 encoding of
    that character; else the NOT-Unicode character must represent an integer, in
    which case its encoding is itself. To decode, assume the next NOT-Unicode
    character is a Unicode character and attempt to decode from the octet stream
    using UTF-8; if this fails then the NOT-Unicode character is an integer, in
    which case read one single octet from the stream and return it.

    OBSERVATION - All possible octet sequences are valid NOT-UTF-8.

    OBSERVATION - NOT-Unicode characters which are Unicode characters will be
    encoded identically in UTF-8 and NOT-UTF-8

    OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
    be represented in UTF-8

    DEFINITION - "NOT-UTF-16" is a bidirectional encoding between a NOT-Unicode
    character stream and a 16-bit word stream, defined as follows: if a
    NOT-Unicode character is a Unicode character then its encoding is the UTF-16
    encoding of that character; else the NOT-Unicode character must represent an
    integer, in which case its encoding is 0xDC00 plus the integer. To decode,
    if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the
    NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else
    the NOT-Unicode character is the Unicode character obtained by decoding as
    if UTF-16.

    OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
    NOT-UTF-16 -> NOT-UTF-8

    OBSERVATION - NOT-Unicode characters which are Unicode characters will be
    encoded identically in UTF-16 and NOT-UTF-16

    OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
    be represented in UTF-16

    DEFINITION - "NOT-UTF-32" is a bidirectional encoding between a NOT-Unicode
    character stream and a 32-bit word stream, defined as follows: if a
    NOT-Unicode character is a Unicode character then its encoding is the UTF-32
    encoding of that character; else the NOT-Unicode character must represent an
    integer, in which case its encoding is 0x0000DC00 plus the integer. To
    decode, if the next 32-bit word is in the range 0x0000DC80 to 0x0000DCFF
    then the NOT-Unicode character is the octet whose value is (word32 -
    0x0000DC00), else the NOT-Unicode character is the Unicode character
    obtained by decoding as if UTF-16.

    OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 ->
    NOT-UTF-32 -> NOT-UTF-8 and NOT-UTF-16 -> NOT-UTF-32 -> NOT-UTF-16

    OBSERVATION - NOT-Unicode characters which are Unicode characters will be
    encoded identically in UTF-32 and NOT-UTF-32

    OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
    be represented in UTF-32

    This would appear to solve Lars' problem, and because the three encodings,
    NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs, no-one need
    get upset.

    I /think/ that will work.
    Jill



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:37:10 CST