Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 18:48:53 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Eric Muller: "Re: Forms for invisible ZWJ (and ZWNJ)"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 22:33, Oliver Christ at oli@trados.com wrote:

> I don't see a big difference between the UTF16 BOMs and the UTF8 one.
> All signal that the file's encoding is Unicode, and specify which
> "variant" is actually used.

The problem is that UNIX computers do not use file contents for indicating
file encoding. It screws up scripts and other essential OS data. see
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

> It should also be relatively simple to pipe any input through e.g. GNU's
> recode for encoding normalization to UTF16 or whatever so that only one
> module (the recoder) needs to be aware of BOMs (and/or "sniffing"
> heuristics). The stream models in Java and .Net implement exactly that.

The problem is deeper than that, as it affect system software. Your are
effectively asking for UNIX platforms to be adapted to handle MS OS
problems. That is not fair.

> Hans Aberg added:
>
>> It is clear that the use of a BOM in UTF-8 should properly be
>> viewed as a file format, and not a character encoding format.
>
> That's not clear to me. I find UTF8 BOMs at the beginning of e.g. an
> .html or .csv file pretty useful, equally useful to { 0xFE 0xFF } or {
> 0xFF 0xFE } at the beginning of a file. I don't think it helps when
> 'file' would report such files as "UTF8 encoded text written by Notepad
> or .Net". But maybe I misunderstood your comment.

It is a file format, in part because if one singles out a subsegment, you
cannot tell which encoding it is. Different file formats use different
leading markers. If Unicode would have supplied an escape character for file
formats, then that could be used for special file formats, such as "UTF-8
text" or "MS text". A plain text UTF-8 file would then not have any such
marker. Thus, the UNIX operating systems would not have to be entirely
rewritten in order to accommodate for UTF-8. If one discovers a file with
the marker "UTF-8 text", then one could supply a program to treat that, as
one does in WWW-browsers. So the BOM is useful to some as a file contents
marker, but a major hurdle to others. But Unicode should not hurt anyone.

Hans Aberg

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Eric Muller: "Re: Forms for invisible ZWJ (and ZWNJ)"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 18:49:46 CST