Re: unicode on Linux

From: Markus Scherer (
Date: Wed Oct 29 2003 - 10:49:25 CST

Philippe Verdy wrote:
> the input:determine strategy will work fine for UTF-8 or SCSU, provided that
> the leading BOM is explicitly encoded. ...

With "determine" I do not mean to restrict to checking for a BOM. There are several ways to
determine the input charset, depending on the protocol and document type etc., including but not
limited to BOM, protocol field, in-doc specification, heuristics (guessing)...

About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical
ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware
tools may choke on it.

> The idea that "if a text (without BOM) looks like valid UTF-8, then it is
> UTF-8; else it uses another legacy encoding" does not work in practice and
> also leads to too many false positives.

It may not work in all cases, but working in >95% or so of cases in practice seems like it works
quite well to me.

>>- if you are absolutely certain that they suffice - use US-ASCII or ISO
>> 8859-1.
> OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit
> labelling (with meta-data or other means) of its encoding: ...

If possible, *all* text should have its charset specified in some way.

> I just wonder why Unicode still maintains that a BOM _should_ not be used in
> UTF-8 texts.

I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools
like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that
the signatures work quite well with more modern tools.


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST