Re: UTF-8 BOM (Re: Charset declaration in HTML) from Philippe Verdy on 2012-07-13 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 13 Jul 2012 18:03:00 +0200

2012/7/13 Steven Atreju <snatreju_at_googlemail.com>:
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> |2012/7/12 Steven Atreju <snatreju_at_googlemail.com>:
> |> UTF-8 is a bytestream, not multioctet(/multisequence).
> |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
> |bytes. It has a lot of internal semantics and constraints. Some things
> |are very meaningful, some play absolutely no role at all and could
> |even be disacarded from digital signature schemes (this includes
> |ignoring BOMs wherever they are, and ignoring the encoding effectiely
> |useed in checksum algorithms, whose first step will be to uniformize
> |and canonicalize the encoding into a single internal form before
> |processing).
> |The effective binary encoding of text streams should NOT play any
> |semantic role (all UTFs should completely be equivalent on the text
> |interface, the bytestream low level is definitely not suitable for
> |handling text and should not play any role in any text parser or
> |collator).
>
> I don't understand what you are saying here.
> UTF-8 is a data interchange format, a text-encoding.
> It is not a filetype!

Not only ! It is a format which is unambiguously bound to a text
filetype, even if this file type may not be intended to be interpreted
by humans (e.g. program sources or riche text formats like HTML)

> A BOM is a byte-order-mark, used to signal different host endianesses.[...]

I'm on this list since long enough to know all this already. And i've
not contradicted this role. However this is not prescriptive for
anything else than text file types (whatever they are). For example
BOMs have abolutely no role for encoding binary images, even if they
include internal multibyte numeric fields.

> |The history os a lot
> |different, and text files have always used another paradigm, based n
> |line records. End of lines initially were not really control
> |characters. And even today the Unix-style end od lines (as advertized
> |on other systems now with the C language) is not using the
> |international standard (CR+LF, which was NOT a Microsoft creation for
> |DOS or Windows).
>
> CR+LF seems to originate in teletypewriters (my restricted
> knowledge, sorry). CR+LF is used in a lot of internet protocols.

> Unix uses \n U+000A to indicate End-Of-Line in text files for a
> long time.

It is a usage that is younger and which became widespread only because
of the success of the C language and its adoption for programming
other systems. For long (and still today), end of lines/newlines have
been encoded very differently. Even on the earliest Unix terminals
(most often based on VT-* protocols), the LF characters were only used
to move the cursor down, but not to start a new paragraph, so Unix
applications had to use a "termcap" database to convert these newlines
into visual end of paragraph, by converting them into CR+LF.

> |May be you would think that "cat utf8file1.txt utf8file2.txt
> |>utf8file.txt" would create problems. For plain text-files, this is no
> |longer a problem, even if there are extra BOMs in the middle, playing
> |as no-ops.
> |now try "cat utf8file1.txt utf16file2.txt > unknownfile.txt" and it
> |will not work. IT will not work as well each time you'll have text
> |files using various SBCS or DBCS encodings (there's never been any
> |standard encoding in the Unic filesystem, simply because the
> |concention was never stored in it; previous filesystems DID have the
> |way to track the encoding by storing metadata; even NTFS could track
> |the encoding, without guessing it from the content).
> |Nothing in fact prehibits Unix to have support of filesystems
> |supporting out-of-band metadata. But for now, you have to assume that
> |the "cat" tool is only usable to concatenate binary sequences, in
> |aritrary orders : it is not properly a tool to handle text files.
>
> If there is a file, you can simply look at it. Use less(1) or any
> other pager to view it as text, use hexdump(1) or od(1) or
> whatever to view it in a different way. You can do that. It is
> not that you can't do that -- no dialog will appear to state that
> there is no application registered to handle a filetype; you look
> at the content, at a glance. You can use cat(1) to concatenate
> whatever files, and the result will be the exact concatenation of
> the exact content of the files you've passed. And you can
> concatenate as long as you want. For example, this mail is
> written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
> encoding («Schöne Überraschung, gelle?» -- works from my point of
> view), and the next paragraph is inserted plain from a file from
> 1971 (http://minnie.tuhs.org/cgi-bin/utree.pl):
>
>
> K. Thompson
>
> D. M. Ritchie
>
>
>
>
> November 3, 1971
> INTRODUCTION
>
>
> This manual gives complete descriptions of all the publicly available features
> of UNIX.
>
>
> This worked well, and there is no magic involved here, ASCII in
> UTF-8, just fine. The metadata is in your head. (Nothing will
> help otherwise!) For metadata, special file formats exist, i.e.,
> SGML. Or text based approaches from which there are some, but i
> can't remember one at a glance ;}. Anyway, such things are used
> for long-time archiving textdata. Though the
> http://www.bitsavers.org/ have chosen a very different way for
> historic data. Metadata in a filesystem is not really something
> for me, and in the end.
>
> |No-op codes are not a problem. They have always existed in all
> |terminal protocols, for various functions such as padding.
>
> Yes, there is some meaningful content around. Many C sources
> contain ^L to force a new page when printed on a line printer, for
> example.
>
> |More and more tools are now aware of the BOM as a convenient way to
> |work reliably with various UTFs. Its absence meaning that the platform
> |default encoding, or the host default, or the encoding selected by the
> |user in his locale environment will be used.
> |BOM's are in fact most useful in contexts where the storage or
> |transmission platform does not allow storing out of band metadata
> |about the encoding. It is extremely small, it does not impact the
> |performance.
>
> A BOM is a byteorder mark. And as such simply overcomes a
> weakness in the definition of the multioctet UTF formats, and that
> is the missing definition of the used byteorder. Since network
> protocols were carefully designed already 40 years ago, taking
> care of such isses (the "network" byteorder is BE, but some
> protocols *define* a different one), someone has failed to dig
> deep enough before something else happened. I will *not*, as a
> human who makes a lot of errors himself, cheer «win-win situation»
> for something which in fact tries to turn a ridiculous miss into
> something win-win-win anything. That's simply not what it is.
> These are text-encodings, not file formats.
>
> |The BOM should now even be completely ignorable in all contexts,
> |including in the middle of combining sequences.
>
> This goes very well for Unicode text content, but i'm not so sure
> on the data storage side.
>
> |This solution would solve many problems to maximize the
> |interoperability (there does not exist an universal interopeability
> |solution that can solve all problems, but at least the UCS with its
> |standardized UTFs are soplving many). Effective solutions that solve
> |problems much more often than what it would create with old legacy
> |applications (most of them being updatable by updating/upgrading the
> |same softwares). The old legacy solutions will become then something
> |only needed by some geeks, and instead of blicking them when they
> |persist in maintaining them, it will be more valuale for them to
> |isolate those contents and offer them via a proxying conversion
> |filter.
> |
> |BOMs are then not a problem, but a solution (which is not the only one
> |but helps filing the gap when other solutions are not usable or
> |available).
>
> BOMs have never been a problem in at least binary data. They are
> an effective solution on old little endian processors which leave
> the task of swapping bytes into the correct order to the server,
> and so for the client. There is no old legacy solution. Unicode
> and UTF-8 ('think this one exclusively on Unix, since
> byte-oriented) will surely be rather alone in a decade or two in
> respect to text content. This will be true, and that was clear
> before an UTF-8 BOM made it into the standard. Again, i will
> *not* cheer win-win, *these* BOMs are simply ridiculous and
> unnecessary. Different to a lot of control characters which had
> their time or still work today.
>
> It is clear that this discussion of mine is completely useless,
> the BOMs are real, they have been placed in the standard, so go
> and see how you can deal with it. It'll end up with radically
> mincing user data on the input side, and then
>
> $ cat utf8withbom > output-withoutbom
>
> will be something very common. One project of mine contains a
> file cacert.pem.

For compatibility reasons, the "cat" tool will never be changed to do
that, as it is used in so many scripts that generate binary output.
But another tool, specifically made for text files, could do the trick
of vonverting BOMS, and even to parse the encodings and reencode on
the fly, even if the input contents use distinct UTFs.

Such text-ware tools should include on Linux/Unix : more, less, page.
tail, od (except when used in binary mode with explicit flags), ... In
my opinion they should not require you to specify which UTF is usedn
or which byte-order is used internally, and they should recognize BOMs
wherever they are, even in the middle of an input, as way to
autodetect when a change of UTF binary form is used, and extra BOMs
(even if they don't change the corrent UTF or byte order) should never
cause any troubles (even if there's one in the middle of a combining
sequence, it should not even break it, but the combining characters
encoded after the BOM should just be decoded according to the UTF
binary format indicated by this newly detected BOM).

If you just want to restrict the tool to use Unicode semantics (to
avoid the complications of the legacy encodinfs which have their own
mutual incompatibilities) you'll replace "cat" by something as short
like "ucat" (for Unicode-aware cat). And if this complicates your
existing shell scripts, you may define a shell alias for changin the
name, or could play with PATH environment variables).

Unix is the only kind of system that does not differentiate or
structure text formats correctly. Its legacy filesystems don't provide
any support for identifying the filetypes (only the filenames *may*
indicate it, but ther things could cause this information to be lost,
including filename truncations, or lack of filename information,
notably in I/O streams). Multiple unspecified encodings in Unix/Linux
have always been a problem when they imply that the files will be
incorrectly interpreted according to the environment/locale of the
viewing user (which has no reason to change : this is a breach in the
highly recommended separation of layers).

And I still don't see why I/O streams in Linux cannot convey an
out-of-band metadata substream (despite there's alwys been a support
for that in the kernel, using ioctl's, on which all the basic
read/write operations are built even in system drivers, they are just
a particular subset of ioctl for handling with the default streams);
metadata should also be a property of the volume on which any
filesystem is mounted and stores filenames (this was not the case in
early Unix filesystems). Before Unix, its ancestors always had
meta-data streams for keeping information about the file types,
encodings, security attributes and ACLs, processing limits, and
controling the archiving/backups/replications, or information about
access time (e.g. the need to dynamically open a remote connectionn
which may require a prior humane authorization, or manual handling to
request some admin to mount an indexed tape, hard disk, or stokc of
punch card, or to provide decryption keys, plus information for
bookkeeping the offline storages or if the volume is currently
available or when it will be available).

On Unix/Linux those metadata are represented using separate files
(that are also unstructured by nature), by only assuming some naming
conventions in the directory hierarchy (but only if this hierarchy is
visible). So, early Unix filesystems are completely agnostic : even
the directory entries are not clearly enforcing any naming convention
or encoding, with the exception of two bytes (0x00 used by the C
language to terminate strings, and 0x2F as the hierarchy separator);
even the names "." and ".." not strictly bound to navigate the
structure (unless they are explicitly bound within the filesystem
hierarchy as name entries; and it remains valid in most Unix
filesystems to name a file with only a 0x01 byte or just a LF byte, or
using escape sequences, or a BEL control which will suspend the output
for the time of playing a sound, or backspace/delete charcters to hide
some parts of the text or to hide some oher files from the list, and
then to perform a simple "ls" to just corrupt your terminal session or
change your terminal mode or to attempt to accec to private user data
by forging attacks against the terminal protocol.) Security is only
offered by separating userss (but a limited number of users as user
ids were initially only small integers and qualified user names were
not part of the securty system).

The absence of strucure and semantics in the core definition of the
system has required to write additional layers on top of that core
system. This may be an advantage because it minimizes the supporr code
to write the kernel, but this advantage is lost when you add layers on
top of it (and when there's no real enforcement about how these layers
will cooperate or compete to perform their services on top of their
shared kernel layer, so additional layers are also developed to
implement a cooperation and interoperability system.
Received on Fri Jul 13 2012 - 11:05:09 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 11:05:10 CDT