Re: Subject: Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 20 2005 - 22:18:36 CST

  • Next message: Arcane Jill: "Re: 32'nd bit & UTF-8"

    Hans Aberg continued:

    > The quote by me above should be:
    >
    > The UTF-8 requirement of prcesses to ignore the BOM.

    You still don't have it right.

    What the Unicode Standard requires of a process interpreting
    a UTF-8 data stream is that:

       If it encounters the byte 0x61, it interprets that as
       U+0061 LATIN SMALL LETTER A, and not as a Chinese character.
       
       If it encounters the byte sequences <0xEF 0xBB 0xBF>, it
       inteprets that as U+FEFF, and not as a question mark or
       the Hebrew letter beth.
       
    A process searching for the letter 't' may properly be implemented
    to ignore 'a's.

    A process concatenating strings may properly be implemented to
    ignore initial U+FEFF characters interpreted as byte order marks.

    It depends on what your process is attempting to accomplish.

    >
    > The problem is that UNIX processes cannot handle this, and trying to make
    > them handle it would screw up the way they work.

    Yes, we all know that trying to support UTF-8 on a Unix system
    if the UTF-8 strings are all prepended with <0xEF 0xBB 0xBF> creates
    havoc.

    Well, guess what, nobody is recommending or requiring that anybody
    do so in Unix systems. Why? Because it creates havoc.

    The problem for Unix systems is properly isolating and abstracting
    its contact points with Windows systems originating UTF-8
    strings with prepended BOMs, and then dealing with them correctly,
    just as it may have to deal with other text conventions, including
    CRLF from Windows systems or CR-delimited files from MacOS systems.

    If you can't do that, well, yes, you're hosed.

    > So the UNIX processes are not UTF-8 conformant, and cannot easily be made to
    > be that. Do you agree now?

    No. It is incorrectly stating the problem to claim that
    Unix processes are not UTF-8 conformant. In fact they handle
    UTF-8 perfectly fine, if the data is constrained to appropriate
    subsets of Unicode characters and follows appropriate text
    conventions.

    Your job is to ensure that your Unix system doesn't choke on
    UTF-8 data using text conventions that it can't handle. For that
    you put in place the appropriate layers, abstractions and
    filters to do the job right. I'm willing to bet that your
    Unix system doesn't do too well, either, if you try piping
    a pdf file to a terminal window.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 22:20:19 CST