From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 14 2004 - 16:38:13 CST
From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> Lars Kristan <lars.kristan@hermes.si> writes:
>
>> Hmmmmm, here lies the catch. According to UTC, you need to keep
>> processing the UNIX filenames as BINARY data. And, also according
>> to UTC, any UTF-8 function is allowed to reject invalid sequences.
>> Basically, you are not supposed to use strcpy to process filenames.
>
> No: strcpy passes raw bytes, it does not interpret them according to
> the locale. It's not "an UTF-8 function".
Correct: [wc]strcpy() handles "string" instances, but not all string
instances are plain-text, so they don't need to obey to UTF encoding rules
(they just obey to the convention of null-byte termination, with no
restriction on the string length, measured as a size in [w]char[_t] but not
as a number of Unicode characters).
This is true for the whole standard C/C++ string libraries, as well as in
Java (String and Char objects or "native" char datatype), and as well in
almost all string handling libraries of common programming languages.
A "locale" defined as "UTF-8" will experiment lots of problems because of
the various ways applications will behave face to encoding "errors"
encountered in filenames: exceptions thrown aborting the program,
substitution by "?" or U+FFFD causing wrong files to be accessed, some files
not treated because their name was considered "invalid" althoug they were
effectively created by some user of another locale...
Filenames are identifiers coded as strings, not as plain-text (even if most
of these filename strings are plain-text).
The solution if then to use a locale based on a "relaxed version of UTF-8"
(some spoke about defining a "NOT-UTF-8" and "NOT-UTF-16" encodings to allow
any sequence of code units, but nobody has thought about how to make
"NOT-UTF-8" and "NOT-UTF-16" mutually fully reversible; now add "NOT-UTF-32"
to this nightmare and you will see that "NOT-UTF-32" needs to encode 2^32
distinct NOT-Unicode-codepoints, and that they must map bijectively to
exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not
found a solution to this problem, and I don't know if such solution even
exists; if such solution exists, it should be quite complex...).
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 16:43:26 CST