Re: surrogate at java's property file

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Oct 08 2001 - 19:40:03 EDT


For Java, the support for supplementary characters is actually better than one might think.

It is true that the char type and the Character class only support 16-bit code units. However, storing UTF-16 strings in String objects and char[] arrays and passing code points as int's in non-JDK APIs works just fine.

The JDK layout engine (which is shared with ICU4C) can display UTF-16 text that includes supplementary characters.

Some JDK converters, where necessary, convert supplementary characters. There will be a GB 18030 converter, for example. Note that the IBM JDK has fixpacks going back at least to 1.3 if not 1.2.2 to add GB 18030 support.

Also, if you get ICU4J, then you can use the UCharacter class from there which uses int types for code points. ICU4J 2.0 will soon come out of the box with Unicode 3.1 properties data. Watch http://oss.software.ibm.com/icu4j/ (You can build such a properties file already with ICU4C's genprops tool.)

Changing the string storage in Java fundamentally from UTF-16 to UTF-32 is impossible with the legacy of Java and JNI code out there. All indexing, length counting, use of char as integer types, JNI getString(), etc. would be broken, and interfacing with major operating systems, browsers, and other software would suddenly be more complicated and require UTF re-transformation. Bad idea.

Best regards,
markus



This archive was generated by hypermail 2.1.2 : Mon Oct 08 2001 - 18:19:23 EDT