Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Fri, 13 Jul 2012 16:04:44 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

 |2012/7/12 Steven Atreju <snatreju_at_googlemail.com>:
 |> UTF-8 is a bytestream, not multioctet(/multisequence).
 |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
 |bytes. It has a lot of internal semantics and constraints. Some things
 |are very meaningful, some play absolutely no role at all and could
 |even be disacarded from digital signature schemes (this includes
 |ignoring BOMs wherever they are, and ignoring the encoding effectiely
 |useed in checksum algorithms, whose first step will be to uniformize
 |and canonicalize the encoding into a single internal form before
 |processing).
 |The effective binary encoding of text streams should NOT play any
 |semantic role (all UTFs should completely be equivalent on the text
 |interface, the bytestream low level is definitely not suitable for
 |handling text and should not play any role in any text parser or
 |collator).

I don't understand what you are saying here.
UTF-8 is a data interchange format, a text-encoding.
It is not a filetype!

A BOM is a byte-order-mark, used to signal different host endianesses.
There are BOM-less UTF-16{LE,BE} and UTF-32{LE,BE}. A BOM is
necessarily not an encoding-indicator. Encoding a byteorder mark
in a byte-oriented data stream must necessarily be born from
either a misunderstanding, laziness or ignorance. But if it is
part of the content, it is part of the content, and thus belongs
to it. You cannot simply truncate user data content at will?!
And automatically??

 |The history os a lot
 |different, and text files have always used another paradigm, based n
 |line records. End of lines initially were not really control
 |characters. And even today the Unix-style end od lines (as advertized
 |on other systems now with the C language) is not using the
 |international standard (CR+LF, which was NOT a Microsoft creation for
 |DOS or Windows).

CR+LF seems to originate in teletypewriters (my restricted
knowledge, sorry). CR+LF is used in a lot of internet protocols.
Unix uses \n U+000A to indicate End-Of-Line in text files for a
long time. This seems logical to me, because there is no cursor
to transport to the left margin of the screen (unless the content
of the text file is about to be interpreted by a terminal directly,
but for that the terminal must be so configured (POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/, Base
Definitions, 11. General Terminal Interface), which was the
purpose of a Carriage-Return.

 |May be you would think that "cat utf8file1.txt utf8file2.txt
 |>utf8file.txt" would create problems. For plain text-files, this is no
 |longer a problem, even if there are extra BOMs in the middle, playing
 |as no-ops.
 |now try "cat utf8file1.txt utf16file2.txt > unknownfile.txt" and it
 |will not work. IT will not work as well each time you'll have text
 |files using various SBCS or DBCS encodings (there's never been any
 |standard encoding in the Unic filesystem, simply because the
 |concention was never stored in it; previous filesystems DID have the
 |way to track the encoding by storing metadata; even NTFS could track
 |the encoding, without guessing it from the content).
 |Nothing in fact prehibits Unix to have support of filesystems
 |supporting out-of-band metadata. But for now, you have to assume that
 |the "cat" tool is only usable to concatenate binary sequences, in
 |aritrary orders : it is not properly a tool to handle text files.

If there is a file, you can simply look at it. Use less(1) or any
other pager to view it as text, use hexdump(1) or od(1) or
whatever to view it in a different way. You can do that. It is
not that you can't do that -- no dialog will appear to state that
there is no application registered to handle a filetype; you look
at the content, at a glance. You can use cat(1) to concatenate
whatever files, and the result will be the exact concatenation of
the exact content of the files you've passed. And you can
concatenate as long as you want. For example, this mail is
written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
encoding («Schöne Überraschung, gelle?» -- works from my point of
view), and the next paragraph is inserted plain from a file from
1971 (http://minnie.tuhs.org/cgi-bin/utree.pl):

      K. Thompson

     D. M. Ritchie

    November 3, 1971
                                  INTRODUCTION

This manual gives complete descriptions of all the publicly available features
of UNIX.

This worked well, and there is no magic involved here, ASCII in
UTF-8, just fine. The metadata is in your head. (Nothing will
help otherwise!) For metadata, special file formats exist, i.e.,
SGML. Or text based approaches from which there are some, but i
can't remember one at a glance ;}. Anyway, such things are used
for long-time archiving textdata. Though the
http://www.bitsavers.org/ have chosen a very different way for
historic data. Metadata in a filesystem is not really something
for me, and in the end.

 |No-op codes are not a problem. They have always existed in all
 |terminal protocols, for various functions such as padding.

Yes, there is some meaningful content around. Many C sources
contain ^L to force a new page when printed on a line printer, for
example.

 |More and more tools are now aware of the BOM as a convenient way to
 |work reliably with various UTFs. Its absence meaning that the platform
 |default encoding, or the host default, or the encoding selected by the
 |user in his locale environment will be used.
 |BOM's are in fact most useful in contexts where the storage or
 |transmission platform does not allow storing out of band metadata
 |about the encoding. It is extremely small, it does not impact the
 |performance.

A BOM is a byteorder mark. And as such simply overcomes a
weakness in the definition of the multioctet UTF formats, and that
is the missing definition of the used byteorder. Since network
protocols were carefully designed already 40 years ago, taking
care of such isses (the "network" byteorder is BE, but some
protocols *define* a different one), someone has failed to dig
deep enough before something else happened. I will *not*, as a
human who makes a lot of errors himself, cheer «win-win situation»
for something which in fact tries to turn a ridiculous miss into
something win-win-win anything. That's simply not what it is.
These are text-encodings, not file formats.

 |The BOM should now even be completely ignorable in all contexts,
 |including in the middle of combining sequences.

This goes very well for Unicode text content, but i'm not so sure
on the data storage side.

 |This solution would solve many problems to maximize the
 |interoperability (there does not exist an universal interopeability
 |solution that can solve all problems, but at least the UCS with its
 |standardized UTFs are soplving many). Effective solutions that solve
 |problems much more often than what it would create with old legacy
 |applications (most of them being updatable by updating/upgrading the
 |same softwares). The old legacy solutions will become then something
 |only needed by some geeks, and instead of blicking them when they
 |persist in maintaining them, it will be more valuale for them to
 |isolate those contents and offer them via a proxying conversion
 |filter.
 |
 |BOMs are then not a problem, but a solution (which is not the only one
 |but helps filing the gap when other solutions are not usable or
 |available).

BOMs have never been a problem in at least binary data. They are
an effective solution on old little endian processors which leave
the task of swapping bytes into the correct order to the server,
and so for the client. There is no old legacy solution. Unicode
and UTF-8 ('think this one exclusively on Unix, since
byte-oriented) will surely be rather alone in a decade or two in
respect to text content. This will be true, and that was clear
before an UTF-8 BOM made it into the standard. Again, i will
*not* cheer win-win, *these* BOMs are simply ridiculous and
unnecessary. Different to a lot of control characters which had
their time or still work today.

It is clear that this discussion of mine is completely useless,
the BOMs are real, they have been placed in the standard, so go
and see how you can deal with it. It'll end up with radically
mincing user data on the input side, and then

  $ cat utf8withbom > output-withoutbom

will be something very common. One project of mine contains a
file cacert.pem.

 |2012/7/12 Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk>:
 |> Nice rant, but actually this has never worked like that. You can't cat
 |> .csv files with headers, html files, images, movies, or countless
 |> other "just files" and get a meaningful result, and never have been
 |> able to.

I will not cheer.

Basically my fault: this issue has indeed been discussed to death.

  Steven
Received on Fri Jul 13 2012 - 09:06:26 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 09:06:26 CDT