Re: Does Java 1.5 support Unicode math alphanumerics as variable names?

From: Philippe Verdy (
Date: Mon Jan 26 2004 - 07:01:41 EST

  • Next message: Peter Kirk: "Collation for Greek letter koppa"

    From: Arcane Jill

    > I would be very surprised if it did, since Java chars are still only
    > sixteen bits wide,

    Yes but they include surrogates as valid values for char, so UTF-16 could be
    used to represent variable names. The main problem is not with variables,
    member variables and methods, but with class and package names which need to
    be mappable into filenames to be stored in a local filesystem (at compile
    time) or in a zip directory entry (for packaged applications). Not all
    characters are usable as valid filenames, due to the way filesystems may
    rearrange or normalize or transcode these names.

    > and the new math alphanumerics are not in BMP. Still, I'd be very happy to
    be proved wrong on this one.

    For now the JLS
    defines the language lexical translation as supporting only Unicode 2.1
    (every thing else must use "Unicode escapes", notably for characters out of
    the BMP which need to be represented as "\uD8xx\uDCxx" and can then only be
    used within String constants.)

    In section 3.8 -- Indentifiers --, you'll find this:

    An identifier is an unlimited-length sequence of Java letters and Java
    digits, the first of which must be a Java letter. An identifier cannot have
    the same spelling (Unicode character sequence) as a keyword (3.9), boolean
    literal (3.10.3), or the null literal (3.10.7).

    Letters and digits may be drawn from the entire Unicode character set, which
    supports most writing scripts in use in the world today, including the large
    sets for Chinese, Japanese, and Korean. This allows programmers to use
    identifiers in their programs that are written in their native languages.
    A "Java letter" is a character for which the method
    Character.isJavaIdentifierStart returns true. A "Java letter-or-digit" is a
    character for which the method Character.isJavaIdentifierPart returns true.

    As the valid characters usable in identifiers need to be mappable into a
    Character instance (which only supports UCS2 code points) so that
    Character.isJavaIdentifier() can return true, including characters out of
    BMP in identifiers would require that surrogates are included in the list of
    possible Character instances whose isJavaIndetifier() test returns true.

    So let's see the Character class documentation:

    public static boolean isJavaIdentifierPart(char ch)
      Determines if the specified character may be part of a Java identifier as
    other than the first character.
      A character may be part of a Java identifier if any of the following are
      * it is a letter (matches the general categories UPPERCASE_LETTER,
      * it is a currency symbol (such as '$')
      * it is a connecting punctuation character (such as '_')
      * it is a digit
      * it is a numeric letter (such as a Roman numeral character)
      * it is a combining mark
      * it is a non-spacing mark
      * isIdentifierIgnorable returns true for the character
      ch - the character to be tested.
      true if the character may be part of a Java identifier; false otherwise.
    See Also:
      isIdentifierIgnorable(char), isJavaIdentifierStart(char),
    isLetterOrDigit(char), isUnicodeIdentifierPart(char)

    The requirements above makes surrogates unsuitable for identifiers, simply
    because surrogates have no suitable general category that matches the above
    * they are neither letters, currency symbols, and so on... because
    surrogates have NO general category;
    * the Character class just list them with a SURROGATE general category, see:
    int Character.getType(Character ch);
    * but it returns false for isDefined() as they don't have an entry in the
    UCD or a value in a range defined in the UCD;

    I doubt that this can be changed.

    This archive was generated by hypermail 2.1.5 : Mon Jan 26 2004 - 07:41:59 EST