Re: Newbie questions: 1) Surrogates in WinXP? 2) Unicode in PostS cript?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 09 2004 - 17:33:23 EDT

  • Next message: Peter Kirk: "Re: [hebrew] Re: Draft proposal for Unicode encoding of holam male"

    From: "Markus Scherer" <markus.scherer@jtcsv.com>
    > Rick Cameron wrote:
    > > Does anyone know of a String class, in C++, Java or .NET, that hides the
    > > actual encoding used, and provides a public API based on code points?
    >
    > Possible, but that would be inefficient. If you have a mismatch between your
    API semantics and your
    > implementation on such a fundamental level, then you incur a performance
    penalty. Either you
    > increase your operation time because you have to translate indexes on the fly,
    or you increase your
    > memory use because you have to keep and maintain a map. You can optimize it,
    of course, especially
    > for UTF-16 where surrogate pairs will be extremely rare, but there is very
    little point in not
    > indexing 16-bit code units.

    Actually the increase in operation time with UTF-16 is extremely minimal, face
    to the cost of adding and maintaining a new set of APIs based on UTF-32, or to
    adapt existing algorithms that already work well with UTF-16 to work now with
    larger code units, or to the additional memory cost involved in almost all
    modern languages where the API will be used.

    Handling UTF-16 surrogates, that can very easily recognized and parsed in both
    directions is not a time-critical operation.

    The only case where it could cause problems is when working with substrings, but
    such behavior can be specified precisely in the updated UTF-16 API.

    Additionally, if one imagines that it is always safe to create substrings based
    only on UTF-32 code units or code points, he also forgets that such operation is
    not more safe because the more user-friendly level is the default grapheme
    cluster. More generally, an API that is based on handling individual
    "characters" (in fact code points) will have its own caveats, notably because of
    the effects of normalization (reordering or recombining of combining
    characters). A good i18n-enabled application or API should better work only with
    strings, not only with "characters" (or code points or UTF-32 code units).

    So the real difficulty is not in the choice of internal representation between
    UTF-16 or UTF-32, but in performing safe coputing on grapheme clusters. Even if
    a String is internally stored and managed with UTF-16 code units, it's very
    basic to work without transcoding it into temporary buffers, using iterators. An
    iterator that parses a UTF-16 encoded string and returns UTF-32 code units or
    code points can safely be created to work with more complex iterators such as
    grapheme cluster iterators. Also libraries that can perform normalization of
    UTF-16 strings without creating any instance of a UTF-32 string already exist.
    The difference of performance with a normalizer working on UTF-32 string
    instances is extremely small as the effective complexity is not at the internal
    encoding level.

    Now imagine a database which is defined to store a length-limited VARCHAR(32)
    field, where the SQL CHAR type is bound at database creation time to be a UTF-32
    code unit. Nothing will forbid the database to use UTF-16 internally in its
    tables for storage, as counting the number of equivalent UTF-32 code units for
    the same string is extremely easy (so a field constraint check will have very
    little to do to count characters, not significantly more than counting
    characters in a null-terminated C/C++ string...).

    JavaScript/ECMAScript and Java both define their "char" to be UTF-16 code-units,
    not necessarily a full code point. This is not a limitation, as soon as you
    realize that the strlen() function or length() method is not described to reurn
    the number of code points but the number of code units needed to allocate
    back-store arrays... A good String library would also add a function to compute
    the number of code points from the same string, and its performance would be
    equivalent to strlen() or length().

    If needed, when the string should not be scanned, a String class could as well
    maintain a cache of the number of code points it stores. But today, most
    softwares already perform strlen() calls extremely often, without impact. In
    Java where the String.length() method is not scanning the backing store but
    using the cached length field of a sized array instance, one can as well add a
    member field to cache this code points' length...

    In fact the actual internal storage of a String may still be UTF-32 or UTF-8 or
    SCSU or CESU-8 or whatever other Unicode-compatible encoding even if the API is
    exposed in terms of UTF-16 code units. The actual size of code units in memory
    should be hidden at the String API level, allowing various internal
    representations that will best fit with the String source/usage/target. In fact
    it is also possible to allow mixing various internal encodings for String
    instances, by caching some member fields to store or flag the convention used
    for encoding/decoding it (this would also include the normalization states of a
    string, or may be other states like the previously checked letter cases).

    Consider a String as an object with a required immutable content (the string
    value), and various properties which can be either computed or cached and added
    as additional members, and then you can minimize the number of transcoding,
    folding or transform operations needed on a string to conform to an interface,
    by delaying all these operations until they are really needed. In most cases,
    the Character level is not so much useful except temporarily for computing these
    operations...



    This archive was generated by hypermail 2.1.5 : Fri Apr 09 2004 - 18:23:18 EDT