RE: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 02 2003 - 22:22:36 EST

Next message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"

Previous message: Philippe Verdy: "RE: MS Windows and Unicode 4.0 ?"
In reply to: Frank Yung-Fong Tang: "RE: UTF-16 inside UTF-8"
Next in thread: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
Reply: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank Yung-Fong Tang writes:
> Philippe Verdy wrote:
>
> > Frank Yung-Fong Tang writes:
> > > But how about the UTF-16 vs UCS4 battle?
> >
> > Forget it: nearly nobody uses UCS-4 except very internally for string
> > processing at the character level. For whole strings, nearly everybody
> > uses
> > UTF-16 as it performs better with less memory costs, and
> because UCS-4 is
> > not needed.
>
> I don't think that is a correct statement. I would like to use UTF-16.
> But it is clear that is not all the case.
>
> 1. Some people in this list preferred UCS4. (Raise your hand if you do)
> 2. wchar_t in Linux's glib is UCS4. (and that is "nearly nobody")
> 3. because of 2, therefore, gconv on linux is using UCS4
> 4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex
> in fcfreetype.h )
> 5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the
> Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32,
> XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec,
> XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender,
> 6. gunichar in linux is ucs4
> 7. Because of 6, pango use UCS4 in the unicode api

We're not speaking about the same thing: I was not discussing the
representation of individual characters (yes it's simple to make
wchar_t 32-bit with UCS4), but the encoding of large amounts of
strings for general text processing. That's where UTF-16 is better.

All the examples you give above are directly related to individual
characters (in fact mostly glyphs).

gconv is a special case, but gconv does not need to store a lot of
text but handles internally very few characters at one time.

So you can have a wchar_t datatype in C/C++ that stores UCS-4, but
your strings will most often not be arrays of wchar_t but of an
intermediate 16-bit size which gets parsed to 32-bit wchar_t by
very simple run-time scanners.

APIs that really use 32-bit chars to represent strings are quite
rare and in fact not needed, as UTF-16 strings will perform better.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"
Previous message: Philippe Verdy: "RE: MS Windows and Unicode 4.0 ?"
In reply to: Frank Yung-Fong Tang: "RE: UTF-16 inside UTF-8"
Next in thread: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
Reply: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 23:26:35 EST