Re: UTF-8 and UTF-16 issues

From: john (john@nisus.com)
Date: Mon Jun 19 2000 - 23:21:22 EDT


>> "OLeary, Sean (NJ)" wrote:
>> UTF-16 is the 16-bit encoding of Unicode that includes the use of
>> surrogates. This is essentially a fixed width encoding.

> certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit units per
> character. certainly the iuc discussion did not spread this under "utf-16"
> but possibly as "ucs-2".
> you can make the point, and this could have been said there, too, that for
> many characters you know they will use exactly one 16-bit unit, and
> you don't need to process surrogates for that. this is not to say the encoding
> is fixed-width; it is the same as how you deal with ascii characters in
> utf-8, without declaring utf-8 to be fixed-width.

>> UTF-8
>> Cons:
>> * Most characters need to expanded into a UTF-16 form prior to table
>> lookups for character properties or codepage mappings.

> rather, i would expect an "expansion" into a 32-bit value, not into surrogate
> pairs. this is more practical (and needs to be done for utf-16, too).

So, then, is UTF-32 fixed-width, or must we aim for a UTF-128
or some such, to end this kind of kludge?

How do ATSUI & TEC deal with these variable-width characters
and then how can one create custom styles?



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT