Re: pre-HTML5 and the BOM

From: Philippe Verdy <>
Date: Sat, 14 Jul 2012 00:38:32 +0200

It would break if the only place where to place a BOM is just the
start of a file. But as I propose, we allow BOMs to occur anywhere to
specify which encoding to use to decode what follows each one, even
shell scripts would work (you could place the BOM on a comment line
after a hash symbol, that line still being below the initial hash-bang
line. In that case, even the various UTFs would be mixable, extra BOMs
would not hurt. and we would live without the legacy use of an
unspecified encoding. That BOM would have to be recognized for any
standard UTF (UTF-8, UTF-16 and UTF-32, and optionally CESU-8 if it
helps; some platforms would even use their own compliant UTFs it it
helps for better performance, for their internal handling within the
boundaries of that platform)

2012/7/13 David Starner <>:
> On Fri, Jul 13, 2012 at 1:29 PM, Jukka K. Korpela <> wrote:
>> 2012-07-13 22:37, David Starner wrote:
>>> Wikipedia says "The Unicode standard recommends against the BOM for
>>> UTF-8." and refers to page 30 of the Unicode Standard, version 6.0,
>>> that says "Use of a BOM is neither required nor recommended for
>>> UTF-8..." Calling it a myth seems bizarre.
>> “Not recommended” is distinct from “recommends against”.
> I disagree; the meaning of the two phrases overlaps in my idolect, and
> while it would be somewhat laconic, I might use "not recommended" to
> mean "if you insist on doing that, please give us a chance to get the
> fire extinguisher first",
>> A
>> more appropriate formulation would be “Use of a BOM is not required for BOM,
>> but may be used as a signature that indicates, with practical certainty,
>> that data is UTF-8 encoded.”
> In the environment that UTF-8 was developed for, a BOM is a nuisance;
> a BOM will stop the shell from properly interpreting a hashbang, and
> other existing programs will lose the BOM, duplicate the BOM, and
> scatter BOMs throughout files. Given the number of text-like file
> formats (like old-school PNM) and number of scripts depending on
> existing behavior, these aren't going to be changed.
> As I said before, Unicode simplified but did not solve the fact that
> text from other operating systems requires some modification before
> working just right. But I don't think that Unicode should recommend
> unconditionally the UTF-8 BOM, because it is problematic in the field
> of use UTF-8 was created for and is still used for.
> --
> Kie ekzistas vivo, ekzistas espero.
Received on Fri Jul 13 2012 - 17:42:34 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 17:42:35 CDT