Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Thu, 12 Jul 2012 14:36:42 +0200

Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no> wrote:

 |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200:
 |
 |> In the meanwhile the UTF-8 BOM is in the standard and thus
 |> contradicts fourty years of (well) good (Unix/POSIX) engineering
 |> and craftsmanship. Where a file is a file and everything is a
 |> file, holistically. Where small tools which do their thing well
 |> can be plugged together to achieve complex tasks. Unicode is
 |> very, very important. Really.
 |>
 |> In the future simple things like '$ cat File1 File2 > File3' will
 |> no longer work that easily.
 |
 |I guess you get the same problem with UTF-16 files also, then?
 |--
 |Leif Halvard Silli

UTF-8 is a bytestream, not multioctet(/multisequence). This is
a perfectly valid data interchange format (IMHO). The embedded
BOM in UTF-8 streams seems to serve the purpose of enabling
automatic encoding detection. To handle that, data inspection is
required, and also user-chosen locale settings (LC_CTYPE,
LC_COLLATE..) must be forcefully overwritten. This _/\_can_/\_;
be the wrong thing, can it. Especially behind the back of someone.

I do liked ISO 10646 more in respect to the clear 31 bit
statement, yes. UTF-16 is a multisequence, so that a character
can consist of multiple codepoints which in turn can consist of
multiple UTF-16 instances. This is harder to handle than having
some UTF-32 integers around, where one integer transports one
codepoint. I don't really understand why one gives up the 1:1
relationship of codepoint<->storage, especially if that doesn't
gain 1:1 relationship on the storage<->character side. Why not
UTF-8 directly, then. Solely MHO.

'Nothing against UTF-32 as a memory representation from my side.
Or, if it's your real desire, UTF-16. For data interchange i
prefer bytes. Besides it is pretty clear that the Unix/POSIX
tools have to be adjusted for real Unicode awareness
(normalization and combining and working on the result). Why is
there a need to embed completely useless information in a file.
You have to special-case this. Like running

  $ < nice-windows-file.txt iconv -f UTF-16 -t UTF-8 | some-work

or something. Stripping the BOM silently may change the checksum.
UTF-8 BOM is horrible in normal data interchange. It maybe ok for
XHTML or XML where some standard uses a fallback encoding, but
then again. Ach. ¡Viva la Revolución!

¡Hasta la Victoria Siempre!

  Steven
Received on Thu Jul 12 2012 - 07:39:36 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 12 2012 - 07:39:42 CDT