Re: Non-ascii string processing?

From: jon@spin.ie
Date: Mon Oct 06 2003 - 10:09:52 CST


> > a word like "Úlite" is always counted as five characters,
> regardless
> > that it might be encoded as six Unicode "characters".
>
> I assume that everybody on this list knows that you count characters
> only after a proper normalization... (like many operations on Unicode
> texts).

A word like "Úlite" will be counted as either five or size things depending on just what the things are in a given context. Whether you call those things "characters" or not is another matter.

Normalisation might result in that string being five or six Unicode characters in length, depending on the normalisation form used. Even while NFC would mean that characters and grapheme-clusters would coincide in this case, that does not apply to all uses of combining characters, so a character count on NFC Unicode is not a reliable means to give a character count.

However a byte count is probably of even less use to an end user anyway (except in so far as diskspace and download times go, and then a rough estimate would serve their purposes). Both byte counts and Unicode-character counts have uses within the implementation of higher-level functionality, and as such both are required.

> > 3) That is a very silly count anyway. If you want to have an idea of
> the
> > "size" of a document, lines or words are much more useful
> units.

To estimate column-inches that will be used characters are much more useful than words, and far more than lines (which will vary according to column-width, font, justification algorithm, etc.)



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST