Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 23:08:33 CST

Next message: Hans Aberg: "Byte-oriented lexer generator for Unicode"

Previous message: Antoine Leca: "__STDC_ISO_10646__ [Was: 32'nd bit & UTF-8]"
In reply to: Kenneth Whistler: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/21 05:18, Kenneth Whistler at kenw@sybase.com wrote:

>> The quote by me above should be:
>>
>> The UTF-8 requirement of prcesses to ignore the BOM.
>
> You still don't have it right.
>
> What the Unicode Standard requires of a process interpreting
> a UTF-8 data stream is that:
>
> If it encounters the byte 0x61, it interprets that as
> U+0061 LATIN SMALL LETTER A, and not as a Chinese character.
>
> If it encounters the byte sequences <0xEF 0xBB 0xBF>, it
> inteprets that as U+FEFF, and not as a question mark or
> the Hebrew letter beth.

This I have understood.

> A process searching for the letter 't' may properly be implemented
> to ignore 'a's.
>
> A process concatenating strings may properly be implemented to
> ignore initial U+FEFF characters interpreted as byte order marks.
>
> It depends on what your process is attempting to accomplish.

I think you have a problem here in the formulation. Or perhaps add an
example, showing how for example BOM and non-BOM strings may be
concatenated.

You have here a complicated, but essentially vacuous statement: As you
present it here, a process may ignore whatever it wants, whenever it wants.
The BOM is not different. I could decide to use any other UTF-8 combination
to achieve the same effect. Right?

>> The problem is that UNIX processes cannot handle this, and trying to make
>> them handle it would screw up the way they work.
>
> Yes, we all know that trying to support UTF-8 on a Unix system
> if the UTF-8 strings are all prepended with <0xEF 0xBB 0xBF> creates
> havoc.
>
> Well, guess what, nobody is recommending or requiring that anybody
> do so in Unix systems. Why? Because it creates havoc.
>
> The problem for Unix systems is properly isolating and abstracting
> its contact points with Windows systems originating UTF-8
> strings with prepended BOMs, and then dealing with them correctly,
> just as it may have to deal with other text conventions, including
> CRLF from Windows systems or CR-delimited files from MacOS systems.
>
> If you can't do that, well, yes, you're hosed.

Yes, it is essentially an inter-platform or file format issue. That is why
it is confusing having the BOM issue mentioned in the Unicode standard.

>> So the UNIX processes are not UTF-8 conformant, and cannot easily be made to
>> be that. Do you agree now?
>
> No. It is incorrectly stating the problem to claim that
> Unix processes are not UTF-8 conformant. In fact they handle
> UTF-8 perfectly fine, if the data is constrained to appropriate
> subsets of Unicode characters and follows appropriate text
> conventions.
>
> Your job is to ensure that your Unix system doesn't choke on
> UTF-8 data using text conventions that it can't handle. For that
> you put in place the appropriate layers, abstractions and
> filters to do the job right. I'm willing to bet that your
> Unix system doesn't do too well, either, if you try piping
> a pdf file to a terminal window.

As you present it here, the Unicode standard mentioning the BOM is just a
confusion of the issue. Especially the formulation
but its presence does not affect conformance to the UTF-8 encoding scheme
If I decide to use another character than a BOM as a marker, or whatever
other combination of encoded characters, would that be illegal? Apparently
not. So you have put in a hard-to-interpret, vacuous statement.

Hans Aberg

Next message: Hans Aberg: "Byte-oriented lexer generator for Unicode"
Previous message: Antoine Leca: "__STDC_ISO_10646__ [Was: 32'nd bit & UTF-8]"
In reply to: Kenneth Whistler: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:37:44 CST