RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 09:29:15 CST

Next message: Lars Kristan: "RE: 32'nd bit & UTF-8"

Previous message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy replied:
>
> > * On UNIX, files are LF delimited.
> > * On UNIX, fopen is always binary, text mode is rare and even then
> > very
> > close to binary.
> > * On UNIX, text mode often doesn't apply, so UNIX is by default in
> > binary
> > mode.
> > * On UNIX, filenames are presented and treated as opaque strings.
> > * On UNIX, filenames are therefore case sensitive.
> > * On UNIX, if a BOM is used in UTF-8 streams, most
> everything breaks.
> > * On UNIX, command line and scripting is very strong, the
> > vulnerability to
> > UTF-8 BOM is very high.
>
> All these are Unix deficiencies.

And stregths as well.
* The size of UNIX files equals the number of bytes in the string(s)
written. Seeks are simple and efficient.
* You can't mess up things by forgetting to set the binary mode.
* You can have filenames in more than one encoding on a system. A deficiency
now, but very useful at the time.
* Case sensitivity can be ambiguous in different languages, can it not? Even
more so when an old system is fed with Unicode characters introduced later
on.
* UNIX does not drop data when an odd encoding creeps in. At least not until
processing it heavily.
And I hope strong scripting is not a deficiency.

> If there had been such simple attribute, the filesystems could have
> been tuned up to support some transparent transformation, such as the
> elimination of CRLF/CR/LF conventions, the transparent filtering of
> BOMs according to the application preference.

The filesystem seems like a central place to address it, yet this a simple
but wrong solution. Pipes need the meta tag also. So, it's the run time
library that needs to address it. Hence, interoperability problems between
old and new programs.

> Those working on a revision of POSIX and ANSI C/C++ should
> now consider
> the need for reliable and portable (documented and approved
> by a formal
> standard) ways to support meta-data on filesystems across multiple
> OSes. This would really help the IT community.

Maybe. I am not sure it would work. Nor that it can be implemented and put
into action soon enough.

UNIX has an approach which doesn't choke on mixed 8-bit data. It delays the
problem, often as far as until the display time. Sometimes nobody cares and
no action is required. That's user friendly. Microsoft (effectively) defines
user friendly as proper display. At the cost of often choking early in the
process, dropping data, popping up dialogs...

Most UNIX users are used to dealing with more than one encoding. Which means
they can survive the transition. Once most of the data (and filenames) is in
UTF-8, UNIX will be just fine with its current approach. Even better than at
the time of numerous encodings. Except for the BOM. And, yes, it will work
only if UTF-8 will be the prevailing format.

Such approach would be very simple and efficient. Some flavors will
definitely adopt it. We'll have to wait and see if they can get away with
it. Oh, BTW, they will need to deal with invalid sequences. Unicode should
provide the means for them to do so. Just like it allows the BOM in UTF-8.

Lars

Next message: Lars Kristan: "RE: 32'nd bit & UTF-8"
Previous message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 09:30:22 CST