From: Markus Scherer (firstname.lastname@example.org)
Date: Fri Apr 09 2004 - 13:00:30 EDT
Rick Cameron wrote:
> IMHO, that's a bit misleading. The String class itself does not appear to be
> aware of SMP characters. It clearly uses UTF-16, and the length it reports
> is the number of code units, not the number of characters or graphemes in
> the string.
There is no contradiction between using UTF-16 (or UTF-8) and handling all of Unicode. The same is
true for ICU where we kept 16-bit strings to not break our users but started handling supplementary
characters in 2000. Everyone else is going down that path when supplementary characters become
important to them and their users. Java 1.5 is on its way there.
> Does anyone know of a String class, in C++, Java or .NET, that hides the
> actual encoding used, and provides a public API based on code points?
Possible, but that would be inefficient. If you have a mismatch between your API semantics and your
implementation on such a fundamental level, then you incur a performance penalty. Either you
increase your operation time because you have to translate indexes on the fly, or you increase your
memory use because you have to keep and maintain a map. You can optimize it, of course, especially
for UTF-16 where surrogate pairs will be extremely rare, but there is very little point in not
indexing 16-bit code units.
See also http://www.unicode.org/notes/tn12/
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.
This archive was generated by hypermail 2.1.5 : Fri Apr 09 2004 - 13:48:09 EDT