Re: Subject: Re: 32'nd bit & UTF-8

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Jan 19 2005 - 16:57:56 CST

Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"

Previous message: Christopher Fynn: "Re: Forms for invisible ZWJ (and ZWNJ)"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Oliver Christ" <oli@trados.com> writes:

> Which is just the same for any other BOM or an encoding specification in
> HTML's META element (which is much worse as you need to read quite some
> content before you know the encoding in which to actually read).

As Hans Aberg said, BOM is usable in a file format (even if
inconvenient), but it makes little sense on the level of encoding
because "beginning of text stream" is an ambiguous concept.

Consider a program which reads a list of filenames to process from a
file, such as the -X / --exclude-from option of tar. Should it support
the case of the file starting with a BOM? Note that currently it
doesn't recode the filenames at all, because a filename is technically
an almost arbitrary sequence of bytes. If a user edits the list of
files, a text editor inserts a BOM, and tar doesn't exclude a file
because its real filename doesn't have a BOM, whose fault it is?

The same question applies to fgrep with a list of patterns to search
for in a text file (one pattern per line). Now, if it starts with a
BOM, does the user want to search for a BOM, or is it a marker to be
stripped?

The diff program compares two text files and produces a text file
which describes the differences in a precise format suitable for
applying the differences to one of the files to obtain the other
(it's suitable only for text files). The format includes lines of
the original files prefixed by characters like a space, plus sign or
minus sign. What it should do with a BOM? If it treats them as other
characters, they will be put in the middle of lines. But the files
with differences are also human-readable, not only machine-readable.
What should a text editor do with a BOM in the middle of a line?
And if diff stripped the BOM, it would lose information; how should
it describe the differences between two files which differ only by
the presence of a BOM?

Unix programs tend to treat a BOM in the same way as a CR before a LF.
If the programer took care to let it recognize a Windows convention,
then it will understand the file (it will not necessarily recreate CR
on output), but by default, without implementing special support, CR
will be treated as a strange whitespace. Internet protocols which
specify CR before a LF are of course supported, but file formats based
on text files generally use LF only. Similarly for BOM: in most
programs where it doesn't just become harmless naturally it will
be treated as a strange character at the beginning.

> I don't see a big difference between the UTF16 BOMs and the UTF8 one.
> All signal that the file's encoding is Unicode, and specify which
> "variant" is actually used.

UTF-16 is not used as a format of text files on Unix because it's
incompatible with ASCII. A UTF-8 BOM cannot be supported by a C
compiler in the same way as a UTF-16 BOM (I mean reading the C
source), because UTF-16 is not supported by a C compiler at all.

UTF-16 is used inside Java, inside some databases, and inside some
library APIs (e.g. Qt). I have *never* met a UTF-16-encoded standalone
file, while UTF-8 is common and becomes more and more common today.

C APIs generally either assume the locale's default encoding (e.g.
localized error messages returned by strerror), or use UTF-8 (e.g.
Gtk+), or use wchar_t which is UTF-32 on Unix (e.g. wide character
variant of the curses library). UTF-32 is only in memory, it never
happens in files.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Previous message: Christopher Fynn: "Re: Forms for invisible ZWJ (and ZWNJ)"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 16:58:51 CST