Re: Non-ascii string processing?

From: Edward H. Trager (
Date: Mon Oct 06 2003 - 15:45:23 CST

On Monday 2003.10.06 21:36:13 +0200, Marco Cimarosti wrote:
> Edward H. Trager wrote:
> > > But I still don't see any use in knowing how many
> > characters are in an UTF-8
> > > string, apart the use that I already mentioned: allocating
> > a buffer for a
> > > UTF-8 to UTF-32 conversion.
> >
> > Well, I know a good use for it: a console or terminal-based
> > application which displays information using fixed-width
> > fonts in a tabular form, such as a subset of records from
> > a database table. To calculate how wide to display each
> > column, knowing the maximum number of characters in the
> > strings for each column is a useful starting place.
> Well, I am just about to start a time consuming task: fixing an application
> which was based on the assumption the number of characters in a string was
> good "starting place" to format tabular text in a fixed width font...
> You have already explained why this can't work when CJK or other scripts pop
> in.
> What you really need for such a thing is a function which computes the
> "width" of a string in terms of display units, rather than its length in
> term of characters.

Yes, I agree. I also need such a function. Do you, Marco, or anyone else, know which function(s)
provide this service? (In my case, something Open Source or GPLed would be ideal, but ICU
would be too heavy). My application started out life in a sheltered ASCII-only
childhood, and now needs to move to the bigger UTF-8 world out there. Fortunately,
it is quite capable of succeeding in that world, but I haven't even started working
on the on-screen table formatting issue yet for exactly this reason.

Actually I believe that if I have to write something myself, making it work for the
Latin-with-combining-diacritics and CJK cases would not be too hard. After that however,
it seems that one would have to work on a script-by-script basis to get it to really
work properly. If it was only a case of Arabic, that would be one thing, but when one
looks at the Indic and Indic-derived scripts ... well, there are a lot of Indic and Indic-derived
scripts! Not that it is hard, but it would certainly take time, and I haven't done an ounce
of research yet to find out whether somebody has done it already or not ...

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST