Re: Non-ascii string processing?

From: jon@spin.ie
Date: Mon Oct 06 2003 - 10:09:52 CST

Next message: Marco Cimarosti: "RE: Non-ascii string processing?"
Previous message: John Delacour: "RE: Non-ascii string processing?"
Maybe in reply to: Theodore H. Smith: "Non-ascii string processing?"
Next in thread: Marco Cimarosti: "RE: Non-ascii string processing?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > a word like "élite" is always counted as five characters,
> regardless
> > that it might be encoded as six Unicode "characters".
>
> I assume that everybody on this list knows that you count characters
> only after a proper normalization... (like many operations on Unicode
> texts).

A word like "élite" will be counted as either five or size things depending on just what the things are in a given context. Whether you call those things "characters" or not is another matter.

Normalisation might result in that string being five or six Unicode characters in length, depending on the normalisation form used. Even while NFC would mean that characters and grapheme-clusters would coincide in this case, that does not apply to all uses of combining characters, so a character count on NFC Unicode is not a reliable means to give a character count.

However a byte count is probably of even less use to an end user anyway (except in so far as diskspace and download times go, and then a rough estimate would serve their purposes). Both byte counts and Unicode-character counts have uses within the implementation of higher-level functionality, and as such both are required.

> > 3) That is a very silly count anyway. If you want to have an idea of
> the
> > "size" of a document, lines or words are much more useful
> units.

To estimate column-inches that will be used characters are much more useful than words, and far more than lines (which will vary according to column-width, font, justification algorithm, etc.)

Next message: Marco Cimarosti: "RE: Non-ascii string processing?"
Previous message: John Delacour: "RE: Non-ascii string processing?"
Maybe in reply to: Theodore H. Smith: "Non-ascii string processing?"
Next in thread: Marco Cimarosti: "RE: Non-ascii string processing?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST