Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Fri, 13 Jul 2012 22:38:45 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

 |2012/7/13 Steven Atreju <snatreju_at_googlemail.com>:
 |> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
 |>
 |> |2012/7/12 Steven Atreju <snatreju_at_googlemail.com>:
 |> |> UTF-8 is a bytestream, not multioctet(/multisequence).
 |> |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
 |> |bytes. It has a lot of internal semantics and constraints.
 |> |The effective binary encoding of text streams should NOT play any
 |> |semantic role (all UTFs should completely be equivalent on the text
 |> |interface, the bytestream low level is definitely not suitable for
 |> |handling text and should not play any role in any text parser or
 |> |collator).
 |>
 |> I don't understand what you are saying here.
 |> UTF-8 is a data interchange format, a text-encoding.
 |> It is not a filetype!
 |
 |Not only ! It is a format which is unambiguously bound to a text
 |filetype, even if this file type may not be intended to be interpreted
 |by humans (e.g. program sources or riche text formats like HTML)
 |
 |> A BOM is a byte-order-mark, used to signal different host endianesses.[...]
 |
 |I'm on this list since long enough to know all this already. And i've
 |not contradicted this role. However this is not prescriptive for

Sure, i know the former and i bet there has been a lot of discussion.

 |anything else than text file types (whatever they are). For example
 |BOMs have abolutely no role for encoding binary images, even if they
 |include internal multibyte numeric fields.

Well, it boils down to that, does it. If Unicode *defines* that
the so-called BOM is in fact a Unicode-indicating tag that MUST
be present, then it is very clear what has to happen for, say,
'$ cat tagless tagged > out' (in an UTF-8 environment). I don't
agree with that though due to the reasons i tried to put in
english words, but this is solely my problem. Another approach
would be an explicit UTF-8-BOM charset. Or, of course,
deprecating the -BE/-LE versions.

I don't agree with just about anything you say about automatic
metadata provision. I know that, in Germany, many, many small
libraries become closed because there is not enough money
available to keep up with the digital race, and even the greater
*do* have problems to stay in touch! I've mentioned bitsavers
already, but this is a drop in the bucket, almost rhetoric. In
other countries the situation is worse.

  Steven
Received on Fri Jul 13 2012 - 15:41:01 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 15:41:01 CDT