RE: Roundtripping in Unicode

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Dec 14 2004 - 05:32:12 CST

Next message: Lars Kristan: "Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"

Previous message: Lars Kristan: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Peter Kirk: "Re: Roundtripping in Unicode"
Maybe reply: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Maybe reply: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Reply: Peter Kirk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I've been following this thread for a while, and I've pretty much got the
hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets, excluding 0x00
and 0x2F. How they are /displayed/ to any given user depends on that user's
locale setting. In this scenario, two users with different locale settings
will see different filenames for the same file, but they will still be able
to access the file via the filename that they see. These two filenames will
be spelt identically in terms of octets, but (apparently) differently when
viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we
consider only one-byte-per-character encodings, then any octet sequence is
"valid" in any locale. But UTF-8 introduces the possibility that an octet
sequence might be "invalid" - a new concept for Unix. So if you change your
locale to UTF-8, then suddenly, some files created by other users might
appear to you to have invalid filenames (though they would still appear
valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different users, A and B,
of whom A has set their locale to Latin-1, and B has set their locale to
UTF-8, then the filename may appear to be valid to user A, but invalid to
user B.

Lars is saying (and he's probably right, because he knows more about Unix
than I) that user B does not necessarily have the right to change the actual
octet sequence which is the filename of F, just to make it appear valid to
user B, because doing so would stop a lot of things working for user A (for
instance, A might have created the file, the filename might be hardcoded in
a script, etc.). So Lars takes a Unix-like approach, saying "retain the
actual octet sequence, but feel free to try to display and manipulate it as
if it were some UTF-8-like encoding in which all octet sequences are valid".
And all this seems to work fine for him, until he tries to roundtrip to
UTF-16 and back.

I'm not sure why anyone's arguing about this though - Phillipe's suggestion
seems to be the perfect solution which keeps everyone happy. So...

...allow me to construct a specific example of what Phillipe suggested only
generally:

DEFINITION - "NOT-Unicode" is the character repertoire consisting of the
whole of Unicode, and 128 additional characters representing integers in the
range 0x80 to 0xFF.

OBSERVATION - Unicode is a subset of NOT-Unicode

DEFINITION - "NOT-UTF-8" is a bidirectional encoding between a NOT-Unicode
character stream and an octet stream, defined as follows: if a NOT-Unicode
character is a Unicode character then its encoding is the UTF-8 encoding of
that character; else the NOT-Unicode character must represent an integer, in
which case its encoding is itself. To decode, assume the next NOT-Unicode
character is a Unicode character and attempt to decode from the octet stream
using UTF-8; if this fails then the NOT-Unicode character is an integer, in
which case read one single octet from the stream and return it.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.

OBSERVATION - NOT-Unicode characters which are Unicode characters will be
encoded identically in UTF-8 and NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
be represented in UTF-8

DEFINITION - "NOT-UTF-16" is a bidirectional encoding between a NOT-Unicode
character stream and a 16-bit word stream, defined as follows: if a
NOT-Unicode character is a Unicode character then its encoding is the UTF-16
encoding of that character; else the NOT-Unicode character must represent an
integer, in which case its encoding is 0xDC00 plus the integer. To decode,
if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the
NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else
the NOT-Unicode character is the Unicode character obtained by decoding as
if UTF-16.

OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
NOT-UTF-16 -> NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are Unicode characters will be
encoded identically in UTF-16 and NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
be represented in UTF-16

DEFINITION - "NOT-UTF-32" is a bidirectional encoding between a NOT-Unicode
character stream and a 32-bit word stream, defined as follows: if a
NOT-Unicode character is a Unicode character then its encoding is the UTF-32
encoding of that character; else the NOT-Unicode character must represent an
integer, in which case its encoding is 0x0000DC00 plus the integer. To
decode, if the next 32-bit word is in the range 0x0000DC80 to 0x0000DCFF
then the NOT-Unicode character is the octet whose value is (word32 -
0x0000DC00), else the NOT-Unicode character is the Unicode character
obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 ->
NOT-UTF-32 -> NOT-UTF-8 and NOT-UTF-16 -> NOT-UTF-32 -> NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are Unicode characters will be
encoded identically in UTF-32 and NOT-UTF-32

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot
be represented in UTF-32

This would appear to solve Lars' problem, and because the three encodings,
NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs, no-one need
get upset.

I /think/ that will work.
Jill

Next message: Lars Kristan: "Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"
Previous message: Lars Kristan: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "RE: Roundtripping in Unicode"
Next in thread: Peter Kirk: "Re: Roundtripping in Unicode"
Maybe reply: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Maybe reply: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Reply: Peter Kirk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:37:10 CST