Re: Non-ascii string processing?

From: Doug Ewell (
Date: Sat Oct 04 2003 - 15:07:17 CST

Theodore H. Smith <delete at elfdata dot com> wrote:

> I'm wondering how people tend to do their non-ascii string processing.
> I'm wondering, if anyone really needs anything other than byte
> oriented code? I'm using UTF8 as my character format, and UTF8 is
> variable width, of course. I offer the option of processing UTF8, with
> byte functions, however.
> EG:
> Start = MyString.InStr( "<" )
> End = MyString.InStr( Start + 1, "> )
> things like this, it really doesn't matter if your data is UTF8, you
> can still process it like bytes! Leading to faster speed, and simpler
> code.

If you really aren't processing anything but the ASCII characters within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).

However, if your goal is to simplify processing of arbitrary UTF-8 text,
including non-ASCII characters, I haven't found a better way than to
read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
processing on the fixed-width UTF-32. That way you don't have to do one
thing for Basic Latin characters and something else for the rest.

You will probably hear from some very prominent Unicode people that
converting to UTF-16 is better, because "most" characters are in the
BMP, for which UTF-16 uses half as much memory. But this approach
doesn't really solve the variable-width problem -- it merely moves it,
from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are keeping
large amounts of text in memory, or are working with a small device such
as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
to be a big problem, and you have the advantage of dealing with a
fixed-width representation for the entire Unicode code space.

All of this assumes that you don't have multi-character processing
issues, like combining characters and normalization, or culturally
appropriate sorting, in which case your character processing WILL be
more complex than ASCII no matter which CES you use.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST