Re: Roundtripping in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 14 2004 - 16:38:13 CST

  • Next message: Philippe Verdy: "Re: Roundtripping in Unicode"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > Lars Kristan <lars.kristan@hermes.si> writes:
    >
    >> Hmmmmm, here lies the catch. According to UTC, you need to keep
    >> processing the UNIX filenames as BINARY data. And, also according
    >> to UTC, any UTF-8 function is allowed to reject invalid sequences.
    >> Basically, you are not supposed to use strcpy to process filenames.
    >
    > No: strcpy passes raw bytes, it does not interpret them according to
    > the locale. It's not "an UTF-8 function".

    Correct: [wc]strcpy() handles "string" instances, but not all string
    instances are plain-text, so they don't need to obey to UTF encoding rules
    (they just obey to the convention of null-byte termination, with no
    restriction on the string length, measured as a size in [w]char[_t] but not
    as a number of Unicode characters).

    This is true for the whole standard C/C++ string libraries, as well as in
    Java (String and Char objects or "native" char datatype), and as well in
    almost all string handling libraries of common programming languages.

    A "locale" defined as "UTF-8" will experiment lots of problems because of
    the various ways applications will behave face to encoding "errors"
    encountered in filenames: exceptions thrown aborting the program,
    substitution by "?" or U+FFFD causing wrong files to be accessed, some files
    not treated because their name was considered "invalid" althoug they were
    effectively created by some user of another locale...

    Filenames are identifiers coded as strings, not as plain-text (even if most
    of these filename strings are plain-text).

    The solution if then to use a locale based on a "relaxed version of UTF-8"
    (some spoke about defining a "NOT-UTF-8" and "NOT-UTF-16" encodings to allow
    any sequence of code units, but nobody has thought about how to make
    "NOT-UTF-8" and "NOT-UTF-16" mutually fully reversible; now add "NOT-UTF-32"
    to this nightmare and you will see that "NOT-UTF-32" needs to encode 2^32
    distinct NOT-Unicode-codepoints, and that they must map bijectively to
    exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not
    found a solution to this problem, and I don't know if such solution even
    exists; if such solution exists, it should be quite complex...).



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 16:43:26 CST