Re: (Informational only: UTF-8 BOM and the real life) from Jukka K. Korpela on 2012-07-25 (Unicode Mail List Archive)

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Thu, 26 Jul 2012 00:45:25 +0300

2012-07-26 0:19, Steven Atreju wrote:

> |ï»¿
>
> And that was an Unicode BOM that has been converted to UTF-8 and
> then been converted to UTF-8 once again.

Apparently the problem is that the data has been doubly encoded: first
into UTF-8, then interpreting the bytes of UTF-8 data, interpreting them
as if they were in windows-1252, and then UTF-8 encoding the resulting
characters. This is of course very incorrect, and not uncommon.

> |vielen Dank fÃ¼r Ihre E-Mail.

So the letter “ü” was munged too, and presumably all non-ASCII data. So
this is not an argument against using BOM in UTF-8. The BOM was a victim
of incorrect processing, like everyone else (outside ASCII). One might
even argue that the BOM is useful here, too, since it immediately
signals that there is something wrong, and “ï»¿” is an encoding error
signature, so to say.

Yucca
Received on Wed Jul 25 2012 - 16:48:11 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 25 2012 - 16:48:12 CDT