RE: explicit 20 bit Unicode range limit (was: UTF-20 etc.)

From: Paul Dempsey (Exchange) (paulde@exchange.microsoft.com)
Date: Tue Jan 26 1999 - 16:10:05 EST


-----Original Message-----
From: schererm@us.ibm.com [mailto:schererm@us.ibm.com]
Sent: Tuesday, January 26, 1999 11:08 AM
To: Unicode List
Subject: explicit 20 bit Unicode range limit (was: UTF-20 etc.)
...
For example, Java .properties and .java files that use a traditional source
encoding make use of an escape sequence \uxxxx to represent Unicode
characters, using 4 hexadecimal digits. Non-BMP characters have to be
written as surrogate pairs, although the source encoding is not UTF-16.
(U-000e 0061 -> \udb40\udc61)
Anyone who writes a .properties or .java file with a non-BMP character has
to know or be able to calculate the UTF-16 form.
It seems desirable to have a new escape sequence format that allows to
write any character as a scalar value. If this is done, the format may take
advantage of the recommended range of characters to limit the fixed length
to less than 8, which would cover the full UCS-4 range. (e.g. U-000e 0061
-> \q0e0061 or \qe0061 etc.)

...

Your goal is a natural source-level representation of characters that
require surrogate pairs in UTF-16. Software can easily interpret a
5-digit escape code and generate a surrogate pair if the underlying
representation is UTF-16. You don't need a 20-bit encoding to achieve
this. For all you care, the internal representation is UTF-8, UCS-4, or
whatever, and it shouldn't matter. Your problem here is that Java
specifies that escape sequences give the UTF-16 encoding of a character.
What you'd like is for Java escapes to specify the UCS-4 code point,
and generate the appropriate representation in the underlying encoding.

It might be appropriate for the Unicode standard to recommend that
software interpret escape codes and hex sequences as UCS-4 code points
and free the user from knowing the details of the encoding.

--- Paul



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT