From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Oct 06 2003 - 04:50:09 CST
Theodore H. Smith wrote:
> Hi lists,
Hi, member.
> I'm wondering how people tend to do their non-ascii string processing.
I think no one has been doing ASCII string processing for decades. :-) But I
guess you meant non-SBCS ("single byte character set") string processing.
> [...]
> So, I'm wondering, in fact, is there ANY code that needs
> explicit UTF8 processing? Heres a few I've thought of.
In general, you need to UTF-32 whenever you need to access the single
*characters* in a string. This is needed for all kinds of lexical or
typographic processing, e.g.:
- case matching or conversion ("â" vs. "Â");
- loose matching ("â" vs. "a");
- displaying the text;
> Can anyone tell me any more? Please feel free to go into great detail
> in your answers. The more detail the better.
There is at least one case in which you need UTF-8-aware code even if not
accessing single characters: it is when you *trim* a string at an arbitrary
byte position. E.g.:
char str1 [9] = "abc";
char * str2 = "αßγ";
strncat(str1, str2, sizeof(str1));
If strncat() is UTF-8 aware: str1 will be "abcαß" + null terminator (8
bytes). But if strncat() is *not* UTF-8 aware, str1 will contain an invalid
UTF-8 string: "abcαß" + an *llegal* byte (0xCE) + null terminator.
_ Marco
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST