Re: Non-ascii string processing?

From: 'Stephane Bortzmeyer' (
Date: Mon Oct 06 2003 - 06:37:44 CST

On Mon, Oct 06, 2003 at 01:52:26PM +0200,
 Marco Cimarosti <> wrote
 a message of 51 lines which said:

> a word like "Úlite" is always counted as five characters, regardless
> that it might be encoded as six Unicode "characters".

I assume that everybody on this list knows that you count characters
only after a proper normalization... (like many operations on Unicode

> 3) That is a very silly count anyway. If you want to have an idea of the
> "size" of a document, lines or words are much more useful units.

Tell that to the editor (editors of paper publications still talk with
this unit "3 000 characters, no more, for tommorrow morning").
> OK. But the length in "characters" of a string is not "character semantics":
> it's plain nonsense, IMHO.

I disagree.

