Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 23:08:33 CST

  • Next message: Hans Aberg: "Byte-oriented lexer generator for Unicode"

    On 2005/01/21 05:18, Kenneth Whistler at kenw@sybase.com wrote:

    >> The quote by me above should be:
    >>
    >> The UTF-8 requirement of prcesses to ignore the BOM.
    >
    > You still don't have it right.
    >
    > What the Unicode Standard requires of a process interpreting
    > a UTF-8 data stream is that:
    >
    > If it encounters the byte 0x61, it interprets that as
    > U+0061 LATIN SMALL LETTER A, and not as a Chinese character.
    >
    > If it encounters the byte sequences <0xEF 0xBB 0xBF>, it
    > inteprets that as U+FEFF, and not as a question mark or
    > the Hebrew letter beth.

    This I have understood.

    > A process searching for the letter 't' may properly be implemented
    > to ignore 'a's.
    >
    > A process concatenating strings may properly be implemented to
    > ignore initial U+FEFF characters interpreted as byte order marks.
    >
    > It depends on what your process is attempting to accomplish.

    I think you have a problem here in the formulation. Or perhaps add an
    example, showing how for example BOM and non-BOM strings may be
    concatenated.

    You have here a complicated, but essentially vacuous statement: As you
    present it here, a process may ignore whatever it wants, whenever it wants.
    The BOM is not different. I could decide to use any other UTF-8 combination
    to achieve the same effect. Right?

    >> The problem is that UNIX processes cannot handle this, and trying to make
    >> them handle it would screw up the way they work.
    >
    > Yes, we all know that trying to support UTF-8 on a Unix system
    > if the UTF-8 strings are all prepended with <0xEF 0xBB 0xBF> creates
    > havoc.
    >
    > Well, guess what, nobody is recommending or requiring that anybody
    > do so in Unix systems. Why? Because it creates havoc.
    >
    > The problem for Unix systems is properly isolating and abstracting
    > its contact points with Windows systems originating UTF-8
    > strings with prepended BOMs, and then dealing with them correctly,
    > just as it may have to deal with other text conventions, including
    > CRLF from Windows systems or CR-delimited files from MacOS systems.
    >
    > If you can't do that, well, yes, you're hosed.

    Yes, it is essentially an inter-platform or file format issue. That is why
    it is confusing having the BOM issue mentioned in the Unicode standard.

    >> So the UNIX processes are not UTF-8 conformant, and cannot easily be made to
    >> be that. Do you agree now?
    >
    > No. It is incorrectly stating the problem to claim that
    > Unix processes are not UTF-8 conformant. In fact they handle
    > UTF-8 perfectly fine, if the data is constrained to appropriate
    > subsets of Unicode characters and follows appropriate text
    > conventions.
    >
    > Your job is to ensure that your Unix system doesn't choke on
    > UTF-8 data using text conventions that it can't handle. For that
    > you put in place the appropriate layers, abstractions and
    > filters to do the job right. I'm willing to bet that your
    > Unix system doesn't do too well, either, if you try piping
    > a pdf file to a terminal window.

    As you present it here, the Unicode standard mentioning the BOM is just a
    confusion of the issue. Especially the formulation
       but its presence does not affect conformance to the UTF-8 encoding scheme
    If I decide to use another character than a BOM as a marker, or whatever
    other combination of encoded characters, would that be illegal? Apparently
    not. So you have put in a hard-to-interpret, vacuous statement.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:37:44 CST