Re: unicode on Linux

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 29 2003 - 05:40:19 CST


----- Original Message -----
From: "Markus Scherer" <markus.scherer@jtcsv.com>
To: "unicode" <unicode@unicode.org>
Sent: Tuesday, October 28, 2003 11:35 PM
Subject: Re: unicode on Linux

> You should use Unicode internally - UTF-16 when you use ICU or most other
libraries and software.
>
> Externally, that is for protocols and files and other data exchange, you
need to identify (input:
> determine; output: label) the encoding of the data and convert between it
and Unicode.
> If you can choose the output encoding, then stay with one of the Unicode
charsets
> (UTF-8 or SCSU etc.), or else

the input:determine strategy will work fine for UTF-8 or SCSU, provided that
the leading BOM is explicitly encoded. I know that this is not recommanded
(at least for UTF-8), but I have several examples showing that files encoded
in UTF-8 without the BOM fail to be identified correctly
as UTF-8. In that case, the result of this automatic determination is still
quite random and depends on the content of the text.

The idea that "if a text (without BOM) looks like valid UTF-8, then it is
UTF-8; else it uses another legacy encoding" does not work in practice and
also leads to too many false positives.

> - if you are absolutely certain that they suffice - use US-ASCII or ISO
8859-1.

OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit
labelling (with meta-data or other means) of its encoding: here also we've
got problems if the text finally looks like valid UTF-8 (however these cases
are more rare).

However, statistically, the way UTF-8 encodes trailing bytes in range 0x80
to 0xBF after leading bytes in the range starting at 0xC0, is a good
indicator in many texts to know if it's ISO-8859-1 or UTF-8: in ISO-8859-1,
this would produce many sequences starting by a single lowercase accented
letters and one or more uppercase accented letters, something that is not
impossible but unlikely. The presence of the BOM however creates a sequence
that is valid in ISO-8859-1, but extremely unlikely in actual texts. The
exceptions exist, but they will mostly occur in very short texts containing
extended Latin1 characters that are not letters.

That's why an algorithm that tries to guess (in absence of explicit
labelling) if a text is either UTF-8 or ISO-8859-1 should always assume it
is UTF-8 if it validates with strict UTF-8 encoding rules. Some problems do
exist however, with the relaxed rules for UTF-8 as it was defined in the
IESG RFC. These old texts (that are valid for this old version of the UTF-8
encoding) still exist now (and may persist for some times in relational
databases that were feeded with them and not scanned for reencoding).

I just wonder why Unicode still maintains that a BOM _should_ not be used in
UTF-8 texts. I think the opposite: as long as the BOM will not cause
problems, such as complete text files which can natively be transported
without any explicit encoding labelling, it should be used. If the plain
text is self-labelled (such as a XML or XHTML source file, but not HTML4
files even if they use a <meta tag>), that leading BOM may be omitted.

The case where the BOM should not be used is when the text needs to be
limited to very short sizes (but in that case the environment where such
short strings are used should have a way to specify and transport the
encoding labelling information as part of its basic protocol). This case
applies for example to individual table fields in databases.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST