From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 08:38:55 CST
Arcane Jill wrote:
> I've been following this thread for a while, and I've pretty
Thanks for bearing with me. And I hope my response will not discourage you
from continuing to do so. That is, until I am banned from the list for
heresy.
> much got the
> hang of the issues here. To summarize:
>
> Unix filenames consist of an arbitrary sequence of octets,
> excluding 0x00
> and 0x2F. How they are /displayed/ to any given user depends
> on that user's
> locale setting. In this scenario, two users with different
> locale settings
> will see different filenames for the same file, but they will
> still be able
> to access the file via the filename that they see. These two
> filenames will
> be spelt identically in terms of octets, but (apparently)
> differently when
> viewed in terms of characters.
>
> At least, that's how it was until the UTF-8 locale came along. If we
I think such problems were already present with Shift-JIS. But already
stated once why this was not noticed and will not repeat myself, unless
explicitly asked to do so.
> consider only one-byte-per-character encodings, then any
> octet sequence is
> "valid" in any locale. But UTF-8 introduces the possibility
> that an octet
> sequence might be "invalid" - a new concept for Unix. So if
> you change your
> locale to UTF-8, then suddenly, some files created by other
> users might
> appear to you to have invalid filenames (though they would
> still appear
> valid when viewed by the file's creator).
>
> A specific example: if a file F is accessed by two different
> users, A and B,
> of whom A has set their locale to Latin-1, and B has set
> their locale to
> UTF-8, then the filename may appear to be valid to user A,
> but invalid to
> user B.
>
> Lars is saying (and he's probably right, because he knows
> more about Unix
> than I) that user B does not necessarily have the right to
> change the actual
> octet sequence which is the filename of F, just to make it
> appear valid to
> user B, because doing so would stop a lot of things working
> for user A (for
> instance, A might have created the file, the filename might
> be hardcoded in
> a script, etc.). So Lars takes a Unix-like approach, saying
> "retain the
> actual octet sequence, but feel free to try to display and
> manipulate it as
> if it were some UTF-8-like encoding in which all octet
> sequences are valid".
> And all this seems to work fine for him, until he tries to
> roundtrip to
> UTF-16 and back.
>
> I'm not sure why anyone's arguing about this though -
> Phillipe's suggestion
> seems to be the perfect solution which keeps everyone happy. So...
Well, it doesn't. The rest of my comments will show you why.
>
> ...allow me to construct a specific example of what Phillipe
> suggested only
> generally:
>
> DEFINITION - "NOT-Unicode" is the character repertoire
> consisting of the
> whole of Unicode, and 128 additional characters representing
> integers in the
> range 0x80 to 0xFF.
As long as we agree that the codepoints used to store the NOT-Unicode data
are valid unicode codepoints. You noticed yourself that NOT-Unicode should
roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed
through UTF-16.
>
> OBSERVATION - Unicode is a subset of NOT-Unicode
But unfortunately data can pass from NOT-Unicode to Unicode. Some people
think that this is terribly bad. One would think that by storing NOT-UTF-8
in NOT-UTF-16 would prevent data from crossing the boundary, but that is not
so.
>
> DEFINITION - "NOT-UTF-8" is a bidirectional encoding between
> a NOT-Unicode
> character stream and an octet stream, defined as follows: if
> a NOT-Unicode
> character is a Unicode character then its encoding is the
> UTF-8 encoding of
> that character; else the NOT-Unicode character must represent
> an integer, in
> which case its encoding is itself. To decode, assume the next
> NOT-Unicode
> character is a Unicode character and attempt to decode from
> the octet stream
> using UTF-8; if this fails then the NOT-Unicode character is
> an integer, in
> which case read one single octet from the stream and return it.
More or less. You have not defined how to return the octet. It must be
returned as a valid Unicode codepoint. And if a Unicode character is
decoded, one must check if it is any of the codepoints used for this purpose
and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains
unchanged.
>
> OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
Yes, that's the sanity check, because this is what we wanted to get.
>
> OBSERVATION - NOT-Unicode characters which are Unicode
> characters will be
> encoded identically in UTF-8 and NOT-UTF-8
Unfortunately not so. Becase you started with the wrong assumption that
NOT-UTF-8 data will not be stored in valid codepoints. But the fact that
this observation is not true is not really a problem.
>
> OBSERVATION - NOT-Unicode characters which are not Unicode
> characters cannot
> be represented in UTF-8
They should be. Being able to pass the NOT-Unicode characters to UTF-16 is
just the most difficult part. If you pass data to an UTF-16 application, you
have no way of knowing if it will chose to convert the data to UTF-32 or
UTF-8 for a certain portion of processing before returning the changed or
unchanged result, again in UTF-16. NOT-Unicode characters must be
representable in all UTF formats. Hence, they need to be valid Unicode
codepoints.
>
> DEFINITION - "NOT-UTF-16" is a bidirectional encoding between
> a NOT-Unicode
> character stream and a 16-bit word stream, defined as follows: if a
> NOT-Unicode character is a Unicode character then its
> encoding is the UTF-16
> encoding of that character; else the NOT-Unicode character
> must represent an
> integer, in which case its encoding is 0xDC00 plus the
> integer. To decode,
> if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the
> NOT-Unicode character is the integer whose value is (word16 -
> 0xDC00), else
> the NOT-Unicode character is the Unicode character obtained
> by decoding as
> if UTF-16.
I think this is called UTF-8B conversion. It satisfies all the requirements
except for the fact it uses unpaired surrogatesm which are not valid
codepoints.
>
> OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
> NOT-UTF-16 -> NOT-UTF-8
Yes, this is close to what we need. We need NOT-UTF-8 -> UTF-16 ->
NOT-UTF-8. We just need to agree that instead of 0xDC00 some other range
must be used.
> This would appear to solve Lars' problem, and because the
> three encodings,
> NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs,
> no-one need
> get upset.
>
> I /think/ that will work.
So, no, unfortunately it doesn't work. I proposed this solution two years
ago. And it was also proposed many years ago by other people. It has two
problems:
1 - Using unpaired surrogates introduces a danger of corrupting the
NON-Unicode data. In a case where an unvalidated UTF-16 string contains an
unpaired high surrogate and is concatenated with an NON-UTF-16 string that
begins with an unpaired low surrogate, representing a NON-Unicode character.
Chosing a valid codepoint sequence instead of unpaired low surrogates avoids
that risk (no matter how unlikely it is).
2 - If I would want to use this approach, it would mean that I would be
limited only to applications that would adopt this approach, that is, at
least not validate unpaired low surrogates. Currently, Unicode standard
defines that unpaired surrogates are invalid data. A Unicode compliant
application may (not 'must', at least in my opition) reject such data at any
time. Changing such a fundamental directive is a problem on its own. And I
cannot blame UTC for not considering it. Especially since due to (1), it is
not a good solution anyway. Even if it would be considered and accepted, it
would take ages before applications would obey it. Until then, I cannot use
them. If an approach that uses valid codepoints is adopted, it can be used
as soon as the codepoints are defined. No existing application needs to (nor
should) change the behavior, unless they start using the new conversion
themselves. Which is not true if they simply receive UTF-8 data that was
obtained via this conversion by some other application.
Lars
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 08:42:31 CST