RE: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 02 2003 - 19:58:46 EST

  • Next message: Peter Constable: "font embedding (was RE: MS Windows and Unicode 4.0 ?)"

    Frank Yung-Fong Tang writes:
    > But how about the UTF-16 vs UCS4 battle?

    Forget it: nearly nobody uses UCS-4 except very internally for string
    processing at the character level. For whole strings, nearly everybody uses
    UTF-16 as it performs better with less memory costs, and because UCS-4 is
    not needed.

    Handling surrogates found in surrogates is quite simple and in fact it is
    even simpler to detect and manage than handling MBCS-encoded strings for
    Asian 8-bit applications, and today MBCS 8-bit processing is performed by
    transforming it first into equivalent internal 16-bit code positions, or
    sometimes by transcoding it to Unicode with UTF-16.

    So I do think that applications that could handle East-Asian DBCS 8-bit text
    (EUC-*, ISO2022-*, JIS) can very easily be modified to work internally with
    UTF-16 (notably because interoperability of Unicode code points with these
    DBCS charsets is excellent as the transcoding is not ambiguous, bijective,
    does not need code reordering, and just consists in a simple mapping table
    implemented now in all OSes localized for Asian markets).

    East-Asian developers have learned since long how to cope with DBCS-encoded
    strings. Now with UTF-16, handling surrogates found in string is even
    simpler, as UTF-16 allows bidirectional and random access to any positions
    in strings, which means additional performance and less tricky algorithms
    for text processing...

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 20:51:28 EST