> I was testing the isLetter method in 1.1.3 on some Katakana characters
> and found that the following characters were identified as letters:
> \u30FC Katakana-Hiragana Prolonged Sound Mark
> \u30FD Katakana Iteration Mark
> \u30FE Katakana Voiced Iteration mark
These characters (and a few others like them) are defined to be
"extenders" (cf. TUS 2.0, page 5-26). Although not letters themselves,
they "extend" letters by application to them. Other examples are
length marks. It would be incorrect to break an identifier on
them if parsing for an alphanumeric span.
> This seems incorrect according to Java's explanation:
> * A character is considered to be a letter if and only if
> * it is specified to be a letter by the Unicode 2.0 standard
> * (category "Lu", "Ll", "Lt", "Lm", or "Lo" in the Unicode
> * specification data file).
These characters are in fact all correctly identified as "Lm" in
the Unicode specification data file (UnicodeData-2.0.14.txt, aka
UNIDATA.TXT), so the return value of the Java isLetter method
> * Note that most ideographic characters are considered
> * to be letters (category "Lo") for this purpose.
> Has anyone else noticed this?
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT