From: Theodore H. Smith (firstname.lastname@example.org)
Date: Sun Oct 05 2003 - 16:19:46 CST
heres some things I think.
> If you really aren't processing anything but the ASCII characters
> your strings, like "<" and ">" in your example, you can probably get
> away with keeping your existing byte-oriented code. At least you won't
> get false matches on the ASCII characters (this was a primary design
> goal of UTF-8).
Yes, and in fact, UTF8 doesn't generate any false matches when
searching for a valid UTF8 string, within another valid UTF8 string.
In fact, if there is UTF8 between the < and >, the processing works
> However, if your goal is to simplify processing of arbitrary UTF-8
> including non-ASCII characters, I haven't found a better way than to
> read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
> processing on the fixed-width UTF-32. That way you don't have to do
> thing for Basic Latin characters and something else for the rest.
Well, I can do most processing just fine, as I said. I only have a
problem with lexical string processing (A = å), or spell checking. And
in fact, lexical string processing is already so complex, it probably
won't make much difference with UTF32 or UTF8, because of conjoining
characters and that.
> You will probably hear from some very prominent Unicode people that
> converting to UTF-16 is better, because "most" characters are in the
> BMP, for which UTF-16 uses half as much memory. But this approach
> doesn't really solve the variable-width problem -- it merely moves it,
> from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are
> large amounts of text in memory, or are working with a small device
> as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
> to be a big problem, and you have the advantage of dealing with a
> fixed-width representation for the entire Unicode code space.
Unfortunately, I'm more concerned about the speed of converting the
UTF8 to UTF32, and back. This is because usually, I can process my UTF8
with byte functions.
> All of this assumes that you don't have multi-character processing
> issues, like combining characters and normalization, or culturally
> appropriate sorting, in which case your character processing WILL be
> more complex than ASCII no matter which CES you use.
Yes. Actually, I haven't yet seen any reasons to not use
byte-oriented-only functions for UTF8, now. Thanks for trying though!
Maybe someone whose native language isn't English and who spends a lot
of time writing string processing code could help me with suggestions
for tasks that need character modes? (like lexical processing a=å, and
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST