Re: unicode on Linux

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Oct 29 2003 - 10:49:25 CST

Next message: Rick McGowan: "Re: Hacek - Typing from a keyboard... Help!!!!"
Previous message: Markus Scherer: "Re: osmanya script transliteration"
In reply to: Philippe Verdy: "Re: unicode on Linux"
Next in thread: Shao, Yiying: "RE: unicode on Linux"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:
> the input:determine strategy will work fine for UTF-8 or SCSU, provided that
> the leading BOM is explicitly encoded. ...

With "determine" I do not mean to restrict to checking for a BOM. There are several ways to
determine the input charset, depending on the protocol and document type etc., including but not
limited to BOM, protocol field, in-doc specification, heuristics (guessing)...

About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical
ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware
tools may choke on it.

> The idea that "if a text (without BOM) looks like valid UTF-8, then it is
> UTF-8; else it uses another legacy encoding" does not work in practice and
> also leads to too many false positives.

It may not work in all cases, but working in >95% or so of cases in practice seems like it works
quite well to me.

>>- if you are absolutely certain that they suffice - use US-ASCII or ISO
>> 8859-1.
>
> OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit
> labelling (with meta-data or other means) of its encoding: ...

If possible, *all* text should have its charset specified in some way.

> I just wonder why Unicode still maintains that a BOM _should_ not be used in
> UTF-8 texts.

I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools
like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that
the signatures work quite well with more modern tools.

markus

Next message: Rick McGowan: "Re: Hacek - Typing from a keyboard... Help!!!!"
Previous message: Markus Scherer: "Re: osmanya script transliteration"
In reply to: Philippe Verdy: "Re: unicode on Linux"
Next in thread: Shao, Yiying: "RE: unicode on Linux"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST