Re: C # character model

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Wed Jun 28 2000 - 05:16:30 EDT


Markus Scherer wrote:
>
> John O'Conner wrote:
> > It appears that this new product is not adopting UTF-32...and is
> > sticking with UTF-16 (or more appropriately UCS-2?).

Not very surprising given the commitment of MS with 16-bit Unicode.

> > APIs use and return single 16-bit values.

Ah, that may be a problem (what is the ToUpper return value of ß?)

> > This certainly doesn't make surrogate-pair values easy to use.
> > What influence, if any, does this have on the adoption of
> > UTF-32 or even UTF-16 using surrogate pairs?

I believe as much as Java...

 
> many other apis and libraries from ms - like uniscribe - support utf-16,

Agreed.

> though. ie 5.1 (exact number?) displays a surrogate pair as one single
> box instead of as two, for example.

Sorry, I beliee you're micing things; that is not a IE feature, it depends
on the platform, I believe.

I was just doing the test with IE5.01 on a 98 box, and there was distinctly
two empty boxes for each (>= U+10000) character! On the other hand,
Windows 2000 is known to have an (embryonary, but far sufficient nowadays)
support for the surrogate pairs.

 
<snip>
> utf-32 is interesting only when fixed-width processing is absolutely necessary.

It is also interesting with some (not Intel-based) platforms, where performance
is better with 32-bit units than with 16-bit.

> the design of the c stdlib assumes that wchar_t strings are fixed-width,

Yes (although I believe use of UTF-16 rather than UCS-2 might be conformant).

> therefore they are migrating to utf-32 regardless of wasting space.

Huh ? What can lead to that conclusion ?
We can perfectly make a conforming C stdlib with 8-bit wchar_t.
And of course, nothing prevents to use not-Unicode 16-bit wchar_t (and in
particular East-Asian encodings), as wchar_t was precisely set up for this
use in the first place.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT