Re: Does Java 1.5 support Unicode math alphanumerics as variable names?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 26 2004 - 07:01:41 EST

Next message: Peter Kirk: "Collation for Greek letter koppa"

Previous message: Arcane Jill: "RE: Does Java 1.5 support Unicode math alphanumerics as variable names?"
In reply to: Arcane Jill: "RE: Does Java 1.5 support Unicode math alphanumerics as variable names?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: Arcane Jill

> I would be very surprised if it did, since Java chars are still only
> sixteen bits wide,

Yes but they include surrogates as valid values for char, so UTF-16 could be
used to represent variable names. The main problem is not with variables,
member variables and methods, but with class and package names which need to
be mappable into filenames to be stored in a local filesystem (at compile
time) or in a zip directory entry (for packaged applications). Not all
characters are usable as valid filenames, due to the way filesystems may
rearrange or normalize or transcode these names.

> and the new math alphanumerics are not in BMP. Still, I'd be very happy to
be proved wrong on this one.

For now the JLS
(http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html)
defines the language lexical translation as supporting only Unicode 2.1
(every thing else must use "Unicode escapes", notably for characters out of
the BMP which need to be represented as "\uD8xx\uDCxx" and can then only be
used within String constants.)

In section 3.8 -- Indentifiers --, you'll find this:

[quote]
An identifier is an unlimited-length sequence of Java letters and Java
digits, the first of which must be a Java letter. An identifier cannot have
the same spelling (Unicode character sequence) as a keyword (§3.9), boolean
literal (§3.10.3), or the null literal (§3.10.7).

Letters and digits may be drawn from the entire Unicode character set, which
supports most writing scripts in use in the world today, including the large
sets for Chinese, Japanese, and Korean. This allows programmers to use
identifiers in their programs that are written in their native languages.
A "Java letter" is a character for which the method
Character.isJavaIdentifierStart returns true. A "Java letter-or-digit" is a
character for which the method Character.isJavaIdentifierPart returns true.
[/quote]

As the valid characters usable in identifiers need to be mappable into a
Character instance (which only supports UCS2 code points) so that
Character.isJavaIdentifier() can return true, including characters out of
BMP in identifiers would require that surrogates are included in the list of
possible Character instances whose isJavaIndetifier() test returns true.

So let's see the Character class documentation:

[quote]
isJavaIdentifierPart
public static boolean isJavaIdentifierPart(char ch)
  Determines if the specified character may be part of a Java identifier as
other than the first character.
  A character may be part of a Java identifier if any of the following are
true:
  * it is a letter (matches the general categories UPPERCASE_LETTER,
LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER)
  * it is a currency symbol (such as '$')
  * it is a connecting punctuation character (such as '_')
  * it is a digit
  * it is a numeric letter (such as a Roman numeral character)
  * it is a combining mark
  * it is a non-spacing mark
  * isIdentifierIgnorable returns true for the character
Parameters:
  ch - the character to be tested.
Returns:
  true if the character may be part of a Java identifier; false otherwise.
Since:
  1.1
See Also:
  isIdentifierIgnorable(char), isJavaIdentifierStart(char),
isLetterOrDigit(char), isUnicodeIdentifierPart(char)
[/quote]

The requirements above makes surrogates unsuitable for identifiers, simply
because surrogates have no suitable general category that matches the above
requirements:
* they are neither letters, currency symbols, and so on... because
surrogates have NO general category;
* the Character class just list them with a SURROGATE general category, see:
int Character.getType(Character ch);
* but it returns false for isDefined() as they don't have an entry in the
UCD or a value in a range defined in the UCD;

I doubt that this can be changed.

Next message: Peter Kirk: "Collation for Greek letter koppa"
Previous message: Arcane Jill: "RE: Does Java 1.5 support Unicode math alphanumerics as variable names?"
In reply to: Arcane Jill: "RE: Does Java 1.5 support Unicode math alphanumerics as variable names?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 26 2004 - 07:41:59 EST