RE: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Apr 10 2004 - 10:31:03 EDT

Next message: Peter Kirk: "Variant positions of combining marks"

Previous message: Peter Kirk: "Re: [hebrew] Re: Draft proposal for Unicode encoding of holam male"
In reply to: Markus Scherer: "Re: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus,

> Rick Cameron wrote:
>> IMHO, that's a bit misleading. The String class
>> itself does not appear to be
>> aware of SMP characters. It clearly uses
>> UTF-16, and the length it reports
>> is the number of code units, not the number
>> of characters or graphemes in the string.
>
> There is no contradiction between using UTF-16 (or UTF-8) and
> handling all of Unicode. The same is
> true for ICU where we kept 16-bit strings to not break our users
> but started handling supplementary
> characters in 2000. Everyone else is going down that path when
> supplementary characters become
> important to them and their users. Java 1.5 is on its way there.

I found that in most cases the most important thing to know about length is
the number of code units because in most cases you are concerned about
buffer sizes. If you really need to know the number of real characters then
it is easy to test. What does the number of characters actually mean to
your applications when characters vary is size, are combining, etc. How
would you use the number of characters in a program that supports Japanese,
Thai and Arabic?

I did however find that variants of srtncpy, strncat etc that will only copy
whole characters is a useful tool. The function will return the number of
code units actually copied instead of a string pointer which you already
have. It is especially nice for C programs to add a null to the end if you
end up with unused code units. For example if I copy 200 code units to a
buffer it will only copy 199 if the 200th is a leading surrogate and the
will insert a null in the buffer instead of the leading surrogate so that if
your application sticks a null in the 201th code unit you will still have a
valid sub string.

This is even more important with character sets like UTF-8 if you do not
want to split characters across buffers.

> > Does anyone know of a String class, in C++, Java or .NET, that hides the
> > actual encoding used, and provides a public API based on code points?
>
> Possible, but that would be inefficient. If you have
> a mismatch between your API semantics and your
> implementation on such a fundamental level, then you
> incur a performance penalty. Either you
> increase your operation time because you have to
> translate indexes on the fly, or you increase your
> memory use because you have to keep and maintain a map.
> You can optimize it, of course, especially
> for UTF-16 where surrogate pairs will be extremely
> rare, but there is very little point in not
> indexing 16-bit code units.

I think and ICU has demonstrated that there are algorithms that are very
efficient that provide code point compares for UTF-16 data. This solves the
problem of processing in UTF-16 and using a UTF-8 database or the other way
around.

Carl

Next message: Peter Kirk: "Variant positions of combining marks"
Previous message: Peter Kirk: "Re: [hebrew] Re: Draft proposal for Unicode encoding of holam male"
In reply to: Markus Scherer: "Re: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Apr 10 2004 - 11:05:18 EDT