RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 15 2004 - 09:15:38 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    Kenneth Whistler wrote:
    > Lars said:
    >
    > > According to UTC, you need to keep processing
    > > the UNIX filenames as BINARY data. And, also according to
    > UTC, any UTF-8
    > > function is allowed to reject invalid sequences. Basically,
    > you are not
    > > supposed to use strcpy to process filenames.
    >
    > This is a very misleading set of statements.
    Perhaps deliberately so.

    >
    > First of all, the UTC has not taken *any* position on the
    > processing of UNIX filenames.
    At this point, I won't make any statement about whether UTC should or need
    not do that.
    Let me just ask if it is appropriate to discuss such issues on this list?

    >
    > It is erroneous to imply that the UTC has indicated that "you
    > are not supposed to use strcpy to process filenames."
    As long as explanatins about validation aren't misinterpreted by some
    people. Is there a thorough explanation of where and how to apply validation
    anywhere in the standard?

    >
    > Any process *interpreting* a UTF-8 code unit sequences as
    > characters can and should recognize invalid sequences, but
    > that is a different matter.
    OK, strcpy does not need to interpret UTF-8. But strchr probably should. Or,
    is it that strchr is for opaque strings and mbschr is for UTF-8 strings?
    Then strchr should remain as is and be used for processing filenames.
    Hopefully, you do not need to search for Unicode characters in it and
    strchr-ing for '/' is all you need. But then all languages are supposed to
    provide functions for processing opaque strings in addition to their Unicode
    functions. Or, alternatively, they need to carefully define how string
    functions should process invalid sequences. If that can be done at all.

    But sooner or later you need to incorporate the filename in some UTF-8 text.
    An error report, for example. You then need to program the boundaries quite
    carefully.

    Not to mention the cost to maintain existing programs. I think it makes
    sense to keep looking for other solutions.

    >
    > If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
    > a process claiming conformance to UTF-8 and ask it to intepret
    > that as Unicode characters, it should tell me that it is
    > garbage. *How* it tells me that it is garbage is a matter of
    > API design, code design, and application design.

    What are stdin, stdout and argv (command line parameters) when a process is
    running in a UTF-8 locale? Binary? Opaque strings? UTF-8?

    > Unicode did not invent the notion of conformance to character
    > encoding standards. What is new about Unicode is that it has
    > *3* interoperable character encoding forms, not just one, and
    > all of them are unusual in some way, because they are designed
    > for a very, very large encoded character repertoire, and
    > involve multibyte and/or non-byte code unit representations.

    The difference is that far more people will be faced with such problems.

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 09:23:52 CST