RE: Roundtripping in Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 14:27:12 CST

  • Next message: Kenneth Whistler: "Re: Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"

    Lars said:

    > According to UTC, you need to keep processing
    > the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
    > function is allowed to reject invalid sequences. Basically, you are not
    > supposed to use strcpy to process filenames.

    This is a very misleading set of statements.

    First of all, the UTC has not taken *any* position on the
    processing of UNIX filenames. That is an implementation issue
    outside the scope of what the UTC normally deals with, and I
    doubt that it will take a position on the issue.

    It is erroneous to imply that the UTC has indicated that "you
    are not supposed to use strcpy to process filenames." It has
    done nothing of the kind, and I don't know of any reason why
    anyone should think otherwise. I certainly use strcpy to process
    filenames, UTF-8 or not, and expect that nearly every implementer
    on the list has done so, too.

    Any process *interpreting* a UTF-8 code unit sequences as
    characters can and should recognize invalid sequences, but
    that is a different matter.

    If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
    a process claiming conformance to UTF-8 and ask it to intepret
    that as Unicode characters, it should tell me that it is
    garbage. *How* it tells me that it is garbage is a matter of
    API design, code design, and application design.

    But there is *nothing* new here.

    If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
    a process claiming conformance to Shift-JIS and ask it to intepret
    that as JIS characters, it should tell me that it is
    garbage. *How* it tells me that it is garbage is a matter of
    API design, code design, and application design.

    Unicode did not invent the notion of conformance to character
    encoding standards. What is new about Unicode is that it has
    *3* interoperable character encoding forms, not just one, and
    all of them are unusual in some way, because they are designed
    for a very, very large encoded character repertoire, and
    involve multibyte and/or non-byte code unit representations.

    > Well, I just hope noone will listen to them and modify strcpy and strchr to
    > validate the data when running in UTF-8 locale and start signalling
    > something (really, where and how?!). The two statements from UTC don't make
    > sense when put together. Unless we are really expected to start building
    > everything from scratch.

    This is bogus. The UTC has never asked anyone to modify strcpy
    and strchr. What anyone implementing UTF-8 using a C runtime
    library (or similar set of functions) has to do is completely
    comparable to what they have to do for supporting any other
    multibyte character encoding on such systems. If your system
    handles euc-kr, euc-tw, and/or euc-jp correctly, then adding
    UTF-8 support is comparable, in principle and in practice.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 14:29:11 CST