RE: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 02 2003 - 22:22:36 EST

  • Next message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"

    Frank Yung-Fong Tang writes:
    > Philippe Verdy wrote:
    >
    > > Frank Yung-Fong Tang writes:
    > > > But how about the UTF-16 vs UCS4 battle?
    > >
    > > Forget it: nearly nobody uses UCS-4 except very internally for string
    > > processing at the character level. For whole strings, nearly everybody
    > > uses
    > > UTF-16 as it performs better with less memory costs, and
    > because UCS-4 is
    > > not needed.
    >
    > I don't think that is a correct statement. I would like to use UTF-16.
    > But it is clear that is not all the case.
    >
    > 1. Some people in this list preferred UCS4. (Raise your hand if you do)
    > 2. wchar_t in Linux's glib is UCS4. (and that is "nearly nobody")
    > 3. because of 2, therefore, gconv on linux is using UCS4
    > 4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex
    > in fcfreetype.h )
    > 5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the
    > Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32,
    > XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec,
    > XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender,
    > 6. gunichar in linux is ucs4
    > 7. Because of 6, pango use UCS4 in the unicode api

    We're not speaking about the same thing: I was not discussing the
    representation of individual characters (yes it's simple to make
    wchar_t 32-bit with UCS4), but the encoding of large amounts of
    strings for general text processing. That's where UTF-16 is better.

    All the examples you give above are directly related to individual
    characters (in fact mostly glyphs).

    gconv is a special case, but gconv does not need to store a lot of
    text but handles internally very few characters at one time.

    So you can have a wchar_t datatype in C/C++ that stores UCS-4, but
    your strings will most often not be arrays of wchar_t but of an
    intermediate 16-bit size which gets parsed to 32-bit wchar_t by
    very simple run-time scanners.

    APIs that really use 32-bit chars to represent strings are quite
    rare and in fact not needed, as UTF-16 strings will perform better.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 23:26:35 EST