From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 09 2004 - 17:33:23 EDT
From: "Markus Scherer" <markus.scherer@jtcsv.com>
> Rick Cameron wrote:
> > Does anyone know of a String class, in C++, Java or .NET, that hides the
> > actual encoding used, and provides a public API based on code points?
>
> Possible, but that would be inefficient. If you have a mismatch between your
API semantics and your
> implementation on such a fundamental level, then you incur a performance
penalty. Either you
> increase your operation time because you have to translate indexes on the fly,
or you increase your
> memory use because you have to keep and maintain a map. You can optimize it,
of course, especially
> for UTF-16 where surrogate pairs will be extremely rare, but there is very
little point in not
> indexing 16-bit code units.
Actually the increase in operation time with UTF-16 is extremely minimal, face
to the cost of adding and maintaining a new set of APIs based on UTF-32, or to
adapt existing algorithms that already work well with UTF-16 to work now with
larger code units, or to the additional memory cost involved in almost all
modern languages where the API will be used.
Handling UTF-16 surrogates, that can very easily recognized and parsed in both
directions is not a time-critical operation.
The only case where it could cause problems is when working with substrings, but
such behavior can be specified precisely in the updated UTF-16 API.
Additionally, if one imagines that it is always safe to create substrings based
only on UTF-32 code units or code points, he also forgets that such operation is
not more safe because the more user-friendly level is the default grapheme
cluster. More generally, an API that is based on handling individual
"characters" (in fact code points) will have its own caveats, notably because of
the effects of normalization (reordering or recombining of combining
characters). A good i18n-enabled application or API should better work only with
strings, not only with "characters" (or code points or UTF-32 code units).
So the real difficulty is not in the choice of internal representation between
UTF-16 or UTF-32, but in performing safe coputing on grapheme clusters. Even if
a String is internally stored and managed with UTF-16 code units, it's very
basic to work without transcoding it into temporary buffers, using iterators. An
iterator that parses a UTF-16 encoded string and returns UTF-32 code units or
code points can safely be created to work with more complex iterators such as
grapheme cluster iterators. Also libraries that can perform normalization of
UTF-16 strings without creating any instance of a UTF-32 string already exist.
The difference of performance with a normalizer working on UTF-32 string
instances is extremely small as the effective complexity is not at the internal
encoding level.
Now imagine a database which is defined to store a length-limited VARCHAR(32)
field, where the SQL CHAR type is bound at database creation time to be a UTF-32
code unit. Nothing will forbid the database to use UTF-16 internally in its
tables for storage, as counting the number of equivalent UTF-32 code units for
the same string is extremely easy (so a field constraint check will have very
little to do to count characters, not significantly more than counting
characters in a null-terminated C/C++ string...).
JavaScript/ECMAScript and Java both define their "char" to be UTF-16 code-units,
not necessarily a full code point. This is not a limitation, as soon as you
realize that the strlen() function or length() method is not described to reurn
the number of code points but the number of code units needed to allocate
back-store arrays... A good String library would also add a function to compute
the number of code points from the same string, and its performance would be
equivalent to strlen() or length().
If needed, when the string should not be scanned, a String class could as well
maintain a cache of the number of code points it stores. But today, most
softwares already perform strlen() calls extremely often, without impact. In
Java where the String.length() method is not scanning the backing store but
using the cached length field of a sized array instance, one can as well add a
member field to cache this code points' length...
In fact the actual internal storage of a String may still be UTF-32 or UTF-8 or
SCSU or CESU-8 or whatever other Unicode-compatible encoding even if the API is
exposed in terms of UTF-16 code units. The actual size of code units in memory
should be hidden at the String API level, allowing various internal
representations that will best fit with the String source/usage/target. In fact
it is also possible to allow mixing various internal encodings for String
instances, by caching some member fields to store or flag the convention used
for encoding/decoding it (this would also include the normalization states of a
string, or may be other states like the previously checked letter cases).
Consider a String as an object with a required immutable content (the string
value), and various properties which can be either computed or cached and added
as additional members, and then you can minimize the number of transcoding,
folding or transform operations needed on a string to conform to an interface,
by delaying all these operations until they are really needed. In most cases,
the Character level is not so much useful except temporarily for computing these
operations...
This archive was generated by hypermail 2.1.5 : Fri Apr 09 2004 - 18:23:18 EDT