Re: Non-ascii string processing?

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Oct 06 2003 - 06:16:05 CST


On 06/10/2003 03:09, Marco Cimarosti wrote:

>Doug Ewell wrote:
>
>
>>Depends on what "processing" you are talking about. Just to cite the
>>most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
>>strlen() will fail dramatically.
>>
>>
>
>Why? The purpose of strlen() is counting the number of *bytes* needed to
>store a certain string, and this works just as fine for UTF-8 as it does for
>SBCS's or DBCS's.
>
>What strlen() cannot do is countîng the number of *characters* in a string.
>But who cares? I can imagine very few situations where someone such an
>information would be useful.
>
>_ Marco
>
>
>
>
This depends on what kind of operations you are wanting to do with the
text. Of course if you are concerned only with storage and transmission
of the text, you don't need to count characters rather than bytes,
except that, as you mention in another posting, you may need to avoid
splitting strings in the middle of characters (and there is actually a
very simple algorithm to avoid that, never split before a byte
10xxxxxx). But if you want to render the text, the rendering system
needs to split the text into characters at some point. And if you want
to do to the text the kinds of processing which I as a linguist am
interested in, you absolutely need to work with characters rather than
bytes, and it can be very important to know the number of characters in
a string - although this number may get confused by normalisation issues.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST