RE: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Dec 02 2003 - 20:43:08 EST

Next message: Christopher John Fynn: "Re: MS Windows and Unicode 4.0 ?"

Previous message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Next in thread: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:

> Frank Yung-Fong Tang writes:
> > But how about the UTF-16 vs UCS4 battle?
>
> Forget it: nearly nobody uses UCS-4 except very internally for string
> processing at the character level. For whole strings, nearly everybody
> uses
> UTF-16 as it performs better with less memory costs, and because UCS-4 is
> not needed.

I don't think that is a correct statement. I would like to use UTF-16.
But it is clear that is not all the case.

1. Some people in this list preferred UCS4. (Raise your hand if you do)
2. wchar_t in Linux's glib is UCS4. (and that is "nearly nobody")
3. because of 2, therefore, gconv on linux is using UCS4
4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex
in fcfreetype.h )
5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the
Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32,
XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec,
XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender,
6. gunichar in linux is ucs4
7. Because of 6, pango use UCS4 in the unicode api

>
> Handling surrogates found in surrogates is quite simple and in fact it is
> even simpler to detect and manage than handling MBCS-encoded strings for
> Asian 8-bit applications, and today MBCS 8-bit processing is performed by
> transforming it first into equivalent internal 16-bit code positions, or
> sometimes by transcoding it to Unicode with UTF-16.
>
> So I do think that applications that could handle East-Asian DBCS
> 8-bit text
> (EUC-*, ISO2022-*, JIS) can very easily be modified to work internally
> with
> UTF-16 (notably because interoperability of Unicode code points with
> these
> DBCS charsets is excellent as the transcoding is not ambiguous,
> bijective,
> does not need code reordering, and just consists in a simple mapping
> table
> implemented now in all OSes localized for Asian markets).
>
> East-Asian developers have learned since long how to cope with
> DBCS-encoded
> strings. Now with UTF-16, handling surrogates found in string is even
> simpler, as UTF-16 allows bidirectional and random access to any
> positions
> in strings, which means additional performance and less tricky algorithms
> for text processing...

Agree. It is simpler to address surrogate compare to handle multibyte.

now the question is, if it is simple to address surrogate, then why
don't we address that later? and put higher priority on other i18n issue
which is harder to address and are more critical if not implement (such
as handling non shortest form which may lead to security problem?)

-- 
--
Frank Yung-Fong Tang
Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

Next message: Christopher John Fynn: "Re: MS Windows and Unicode 4.0 ?"
Previous message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Next in thread: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 21:39:11 EST