Re: Non-ascii string processing?

From: jon@spin.ie
Date: Mon Oct 06 2003 - 08:31:09 CST


> > If you really aren't processing anything but the ASCII characters
> > within
> > your strings, like "<" and ">" in your example,
> you can probably get
> > away with keeping your existing byte-oriented code. At least you won't
> > get false matches on the ASCII characters (this was a primary design
> > goal of UTF-8).
>
> Yes, and in fact, UTF8 doesn't generate any false matches when
> searching for a valid UTF8 string, within another valid UTF8 string.

However it will generate false misses when searching for a valid UTF-8 string within an invalid UTF-8 string. In important cases this can lead to severe security issues, for example if you were doing the searching to filter disallowed sequences (say "<script" in an HTML filter or "../" in a URI filter) and the UTF-8 is later converted by a toleratant converter then the disallowed sequences can be sneaked past the filter by sending invalid UTF-8. Certainly there have been cases of this in the past (IIS for example could be fooled into accessing files outside of the webroot).
Hence your search function must either include a check for invalid UTF-8 or be used only in a situation where you know that this won't cause problems (either because invalid UTF-8 will raise an error elsewhere, or becuase there is no possible security problems from such data). In particular if it was part of a library that might be used elsewhere there could be problems as the user of the library might assume you are doing more checking than you are, and neglect to check him- or herself.

> Unfortunately, I'm more concerned about the speed of converting the
> UTF8 to UTF32, and back. This is because usually, I can process my UTF8
> with byte functions.

This is a "swings and roundabouts" situation. Granted dealing with a large array or transmitting a stream of 8-bit units will generally be faster than dealing with a similarly sized stream of 32-bit units (they will be similarly sized if they mainly have ASCII data - and even the worse-case scenario for UTF-8 won't be larger than the equivalent UTF-32 for valid Unicode characters). At the same time though dealing with a single 32-bit unit is generally faster than dealing with a single 8-bit unit on most modern machines; the 8-bit unit will generally be converted to and from 32-bit or larger units anyway - so if you have an average of 1.2 (say it's mainly from the ASCII) octets per character in UTF-8 you are really dealing with 1.2 times as many 32-bit units as if you used UTF-32. If you are coming closer to an average of 4 octets per character in UTF-8 then you are qadrupling the number of 32-bit units to process, as well as possible conversion overhead.

The effects of this on processing efficiency is going to depend on just what you are doing with the characters, and what optimisations can be applied (whether by the programmer or the compiler). For some operations UTF-8 can be considerably less efficient than UTF-32.

It also depends on how much the properties you are dealing with are "hidden" by UTF-8. On the one hand the character-based strlen mentioned in this thread is easy to write for UTF-8:

size_t charlen(const char* str){//assumes valid UTF-8
        size_t ret = 0;
        while (*str)
                if ((*str++ & 0x80) = 0)
                        ++ret;
        return ret;

}

How this compares with the UTF-32 equivalent will vary. Note that it still has validity issues. Generally though UTF-8 doesn't have many problems with this. On the other hand while it is certainly possible to use UTF-8 to do the property lookup needed for most functionality that threats Unicode as more than just a bunch of 21-bit numbers encoded in various ways, it is easier and more efficient (often including memory size of the program) to do much of it with UTF-16 or UTF-32.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST