Re: Non-ascii string processing?

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Oct 05 2003 - 17:10:48 CST


Theodore H. Smith <delete at elfdata dot com> wrote:

>> If you really aren't processing anything but the ASCII characters
>> within your strings, like "<" and ">" in your example, you can
>> probably get away with keeping your existing byte-oriented code.
>> At least you won't get false matches on the ASCII characters (this
>> was a primary design goal of UTF-8).
>
> Yes, and in fact, UTF8 doesn't generate any false matches when
> searching for a valid UTF8 string, within another valid UTF8 string.
>
> In fact, if there is UTF8 between the < and >, the processing works
> just fine.

Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.

>> However, if your goal is to simplify processing of arbitrary UTF-8
>> text, including non-ASCII characters, I haven't found a better way
>> than to read in the UTF-8, convert it on the fly to UTF-32, and THEN
>> do your processing on the fixed-width UTF-32. That way you don't
>> have to do one thing for Basic Latin characters and something else
>> for the rest.
>
> Well, I can do most processing just fine, as I said. I only have a
> problem with lexical string processing (A = å), or spell checking. And
> in fact, lexical string processing is already so complex, it probably
> won't make much difference with UTF32 or UTF8, because of conjoining
> characters and that.

You mean, it's so complex to keep track of canonical equivalences, we
might as well just treat it all as a sequence of isolated bytes?
Doesn't sound like Unicode text processing to me.

> Unfortunately, I'm more concerned about the speed of converting the
> UTF8 to UTF32, and back. This is because usually, I can process my
> UTF8 with byte functions.

Check your assumptions about speed again. Converting between UTF-8 and
Unicode scalar values really isn't a computationally expensive
operation. It's best to do some profiling before assuming UTF-8
conversion will slow you down much.

> Maybe someone whose native language isn't English and who spends a lot
> of time writing string processing code could help me with suggestions
> for tasks that need character modes? (like lexical processing a=å, and
> spell checking).

You are using the rather loose term "lexical processing" to refer to
setting up equivalence classes between characters (e.g. between U+0061
and U+00E5). This is language-dependent, and complex enough on its own,
but trying to do it while you continue to treat U+00E5 as the sequence
<0xC3, 0xA5> is much harder and much slower than if you had just
converted the UTF-8 in the first place.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST