Re: Non-ascii string processing?

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Oct 07 2003 - 07:22:52 CST


On 07/10/2003 05:29, Marco Cimarosti wrote:

>Peter Kirk wrote:
>
>
>>For i% = 1 to Len(utf8string$)
>> c$ = Mid(utf8string$, i%, 1)
>> Process c$
>>Next i%
>>
>>Such a loop would be more efficient in UTF-32 of course, but this is
>>still a real need for working with character counts.
>>
>>
>
>If the string type and function of this Basic dialect is not Unicode-aware,
>then:
>
>- Len(s$) returns the number of *bytes* in the string;
>
>- Mid(s$, i%, 1) returns a single *byte*;
>
>- Your Process() subroutine won't work...
>
>If the string type and functions are Unicode aware (as, e.g., in Visual
>Basic or VBScript), then I'd expect that the actual internal representation
>is hidden from the programmer, hence it makes no sense to talk about an
>"UTF-8 string".
>
>_ Marco
>
>
>
>
>
>
>
You are correct, of course. I was assuming a Unicode-aware dialect of
Basic. But my variable names are no more guaranteed to be meaningful and
appropriate than are Unicode character names ;-) ; they are only
required to be distinct.

I could imagine a dialect of Basic which had separate string handling
functions for UTF-8 bytes and for characters. This is how the
Unicode-aware version of the SIL Consistent Changes stream editor works,
see http://www.sil.org/computing/catalog/show_software.asp?id=4.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST