RE: Non-ascii string processing?

From: Marco Cimarosti (
Date: Mon Oct 06 2003 - 05:52:26 CST

Stephane Bortzmeyer wrote:
> On Mon, Oct 06, 2003 at 12:09:34PM +0200,
> Marco Cimarosti <> wrote
> a message of 14 lines which said:
> > What strlen() cannot do is countÓng the number of
> *characters* in a string.
> > But who cares? I can imagine very few situations where
> someone such an
> > information would be useful.
> It is one thing to explain that strlen() has byte semantics and not
> character semantics. It is another to assume that character semantics
> are useless.

I never said that character semantics are useless: I said that it is almost
always useless to count the *number* of Unicode characters in a string.

One of the few cases in which such a count could be useful is to
pre-allocate a buffer for an UTF-8 to UTF-32 conversion. But there is no
need of a general purpose API function for such a special need.

> Most text-processing software allow you to count the
> number of characters in a document, for instance.

Yes. And:

1) That is a very special need of a very special kind of application (a word
processor), so it doesn't justify a general purpose API function for that:
people don't normally write word processors every day.

2) That count cannot be done by counting Unicode "characters" (i.e.,
encoding units): you have to count the object that the user perceives as
"typographical characters". E.g., control or formatting "characters" should
be ignored, sequences of two or more space "characters" should be counted as
one, and a word like "ťlite" is always counted as five characters,
regardless that it might be encoded as six Unicode "characters". In an Indic
or Korean text, each syllable counts as a single character, although it may
be encoded as long sequences of Unicode "characters".

3) That is a very silly count anyway. If you want to have an idea of the
"size" of a document, lines or words are much more useful units.

> Any decent Unicode programmaing environment should give you two
> functions, one for byte semantics and one for character
> semantics. Both are useful.

OK. But the length in "characters" of a string is not "character semantics":
it's plain nonsense, IMHO.

_ Marco

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST