RE: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Apr 10 2004 - 10:31:03 EDT

  • Next message: Peter Kirk: "Variant positions of combining marks"

    Markus,

    > Rick Cameron wrote:
    >> IMHO, that's a bit misleading. The String class
    >> itself does not appear to be
    >> aware of SMP characters. It clearly uses
    >> UTF-16, and the length it reports
    >> is the number of code units, not the number
    >> of characters or graphemes in the string.
    >
    > There is no contradiction between using UTF-16 (or UTF-8) and
    > handling all of Unicode. The same is
    > true for ICU where we kept 16-bit strings to not break our users
    > but started handling supplementary
    > characters in 2000. Everyone else is going down that path when
    > supplementary characters become
    > important to them and their users. Java 1.5 is on its way there.

    I found that in most cases the most important thing to know about length is
    the number of code units because in most cases you are concerned about
    buffer sizes. If you really need to know the number of real characters then
    it is easy to test. What does the number of characters actually mean to
    your applications when characters vary is size, are combining, etc. How
    would you use the number of characters in a program that supports Japanese,
    Thai and Arabic?

    I did however find that variants of srtncpy, strncat etc that will only copy
    whole characters is a useful tool. The function will return the number of
    code units actually copied instead of a string pointer which you already
    have. It is especially nice for C programs to add a null to the end if you
    end up with unused code units. For example if I copy 200 code units to a
    buffer it will only copy 199 if the 200th is a leading surrogate and the
    will insert a null in the buffer instead of the leading surrogate so that if
    your application sticks a null in the 201th code unit you will still have a
    valid sub string.

    This is even more important with character sets like UTF-8 if you do not
    want to split characters across buffers.

    > > Does anyone know of a String class, in C++, Java or .NET, that hides the
    > > actual encoding used, and provides a public API based on code points?
    >
    > Possible, but that would be inefficient. If you have
    > a mismatch between your API semantics and your
    > implementation on such a fundamental level, then you
    > incur a performance penalty. Either you
    > increase your operation time because you have to
    > translate indexes on the fly, or you increase your
    > memory use because you have to keep and maintain a map.
    > You can optimize it, of course, especially
    > for UTF-16 where surrogate pairs will be extremely
    > rare, but there is very little point in not
    > indexing 16-bit code units.

    I think and ICU has demonstrated that there are algorithms that are very
    efficient that provide code point compares for UTF-16 data. This solves the
    problem of processing in UTF-16 and using a UTF-8 database or the other way
    around.

    Carl



    This archive was generated by hypermail 2.1.5 : Sat Apr 10 2004 - 11:05:18 EDT