Re: Subject: Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 20 2005 - 22:18:36 CST

Next message: Arcane Jill: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg continued:

> The quote by me above should be:
>
> The UTF-8 requirement of prcesses to ignore the BOM.

You still don't have it right.

What the Unicode Standard requires of a process interpreting
a UTF-8 data stream is that:

   If it encounters the byte 0x61, it interprets that as
   U+0061 LATIN SMALL LETTER A, and not as a Chinese character.

   If it encounters the byte sequences <0xEF 0xBB 0xBF>, it
   inteprets that as U+FEFF, and not as a question mark or
   the Hebrew letter beth.

A process searching for the letter 't' may properly be implemented
to ignore 'a's.

A process concatenating strings may properly be implemented to
ignore initial U+FEFF characters interpreted as byte order marks.

It depends on what your process is attempting to accomplish.

>
> The problem is that UNIX processes cannot handle this, and trying to make
> them handle it would screw up the way they work.

Yes, we all know that trying to support UTF-8 on a Unix system
if the UTF-8 strings are all prepended with <0xEF 0xBB 0xBF> creates
havoc.

Well, guess what, nobody is recommending or requiring that anybody
do so in Unix systems. Why? Because it creates havoc.

The problem for Unix systems is properly isolating and abstracting
its contact points with Windows systems originating UTF-8
strings with prepended BOMs, and then dealing with them correctly,
just as it may have to deal with other text conventions, including
CRLF from Windows systems or CR-delimited files from MacOS systems.

If you can't do that, well, yes, you're hosed.

> So the UNIX processes are not UTF-8 conformant, and cannot easily be made to
> be that. Do you agree now?

No. It is incorrectly stating the problem to claim that
Unix processes are not UTF-8 conformant. In fact they handle
UTF-8 perfectly fine, if the data is constrained to appropriate
subsets of Unicode characters and follows appropriate text
conventions.

Your job is to ensure that your Unix system doesn't choke on
UTF-8 data using text conventions that it can't handle. For that
you put in place the appropriate layers, abstractions and
filters to do the job right. I'm willing to bet that your
Unix system doesn't do too well, either, if you try piping
a pdf file to a terminal window.

--Ken

Next message: Arcane Jill: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 22:20:19 CST