Re: unicode on Linux

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Oct 29 2003 - 10:49:25 CST


Philippe Verdy wrote:
> the input:determine strategy will work fine for UTF-8 or SCSU, provided that
> the leading BOM is explicitly encoded. ...

With "determine" I do not mean to restrict to checking for a BOM. There are several ways to
determine the input charset, depending on the protocol and document type etc., including but not
limited to BOM, protocol field, in-doc specification, heuristics (guessing)...

About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical
ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware
tools may choke on it.

> The idea that "if a text (without BOM) looks like valid UTF-8, then it is
> UTF-8; else it uses another legacy encoding" does not work in practice and
> also leads to too many false positives.

It may not work in all cases, but working in >95% or so of cases in practice seems like it works
quite well to me.

>>- if you are absolutely certain that they suffice - use US-ASCII or ISO
>> 8859-1.
>
> OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit
> labelling (with meta-data or other means) of its encoding: ...

If possible, *all* text should have its charset specified in some way.

> I just wonder why Unicode still maintains that a BOM _should_ not be used in
> UTF-8 texts.

I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools
like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that
the signatures work quite well with more modern tools.

markus



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST