Re: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Apr 09 2004 - 13:00:30 EDT

  • Next message: Jony Rosenne: "RE: [hebrew] Draft proposal for Unicode encoding of holam male"

    Rick Cameron wrote:
    > IMHO, that's a bit misleading. The String class itself does not appear to be
    > aware of SMP characters. It clearly uses UTF-16, and the length it reports
    > is the number of code units, not the number of characters or graphemes in
    > the string.

    There is no contradiction between using UTF-16 (or UTF-8) and handling all of Unicode. The same is
    true for ICU where we kept 16-bit strings to not break our users but started handling supplementary
    characters in 2000. Everyone else is going down that path when supplementary characters become
    important to them and their users. Java 1.5 is on its way there.

    > Does anyone know of a String class, in C++, Java or .NET, that hides the
    > actual encoding used, and provides a public API based on code points?

    Possible, but that would be inefficient. If you have a mismatch between your API semantics and your
    implementation on such a fundamental level, then you incur a performance penalty. Either you
    increase your operation time because you have to translate indexes on the fly, or you increase your
    memory use because you have to keep and maintain a map. You can optimize it, of course, especially
    for UTF-16 where surrogate pairs will be extremely rare, but there is very little point in not
    indexing 16-bit code units.

    See also http://www.unicode.org/notes/tn12/

    Best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Fri Apr 09 2004 - 13:48:09 EDT