RE: Non-ascii string processing?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Oct 06 2003 - 04:50:09 CST


Theodore H. Smith wrote:
> Hi lists,

Hi, member.

> I'm wondering how people tend to do their non-ascii string processing.

I think no one has been doing ASCII string processing for decades. :-) But I
guess you meant non-SBCS ("single byte character set") string processing.

> [...]
> So, I'm wondering, in fact, is there ANY code that needs
> explicit UTF8 processing? Heres a few I've thought of.

In general, you need to UTF-32 whenever you need to access the single
*characters* in a string. This is needed for all kinds of lexical or
typographic processing, e.g.:

- case matching or conversion ("â" vs. "Â");

- loose matching ("â" vs. "a");

- displaying the text;

> Can anyone tell me any more? Please feel free to go into great detail
> in your answers. The more detail the better.

There is at least one case in which you need UTF-8-aware code even if not
accessing single characters: it is when you *trim* a string at an arbitrary
byte position. E.g.:

        char str1 [9] = "abc";
        char * str2 = "αßγ";

        strncat(str1, str2, sizeof(str1));

If strncat() is UTF-8 aware: str1 will be "abcαß" + null terminator (8
bytes). But if strncat() is *not* UTF-8 aware, str1 will contain an invalid
UTF-8 string: "abcαß" + an *llegal* byte (0xCE) + null terminator.

_ Marco



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST