RE: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Dec 02 2003 - 20:43:08 EST

  • Next message: Christopher John Fynn: "Re: MS Windows and Unicode 4.0 ?"

    Philippe Verdy wrote:

    > Frank Yung-Fong Tang writes:
    > > But how about the UTF-16 vs UCS4 battle?
    >
    > Forget it: nearly nobody uses UCS-4 except very internally for string
    > processing at the character level. For whole strings, nearly everybody
    > uses
    > UTF-16 as it performs better with less memory costs, and because UCS-4 is
    > not needed.

    I don't think that is a correct statement. I would like to use UTF-16.
    But it is clear that is not all the case.

    1. Some people in this list preferred UCS4. (Raise your hand if you do)
    2. wchar_t in Linux's glib is UCS4. (and that is "nearly nobody")
    3. because of 2, therefore, gconv on linux is using UCS4
    4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex
    in fcfreetype.h )
    5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the
    Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32,
    XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec,
    XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender,
    6. gunichar in linux is ucs4
    7. Because of 6, pango use UCS4 in the unicode api

    >
    > Handling surrogates found in surrogates is quite simple and in fact it is
    > even simpler to detect and manage than handling MBCS-encoded strings for
    > Asian 8-bit applications, and today MBCS 8-bit processing is performed by
    > transforming it first into equivalent internal 16-bit code positions, or
    > sometimes by transcoding it to Unicode with UTF-16.
    >
    > So I do think that applications that could handle East-Asian DBCS
    > 8-bit text
    > (EUC-*, ISO2022-*, JIS) can very easily be modified to work internally
    > with
    > UTF-16 (notably because interoperability of Unicode code points with
    > these
    > DBCS charsets is excellent as the transcoding is not ambiguous,
    > bijective,
    > does not need code reordering, and just consists in a simple mapping
    > table
    > implemented now in all OSes localized for Asian markets).
    >
    > East-Asian developers have learned since long how to cope with
    > DBCS-encoded
    > strings. Now with UTF-16, handling surrogates found in string is even
    > simpler, as UTF-16 allows bidirectional and random access to any
    > positions
    > in strings, which means additional performance and less tricky algorithms
    > for text processing...

    Agree. It is simpler to address surrogate compare to handle multibyte.

    now the question is, if it is simple to address surrogate, then why
    don't we address that later? and put higher priority on other i18n issue
    which is harder to address and are more critical if not implement (such
    as handling non shortest form which may lead to security problem?)

    -- 
    --
    Frank Yung-Fong Tang
    Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
    AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 21:39:11 EST