From: Markus Scherer (markus.icu@gmail.com)
Date: Fri Feb 17 2006 - 12:31:28 CST
I assume it's an oversight for Java to not return the Han numeric
values. They are Unicode property values for those characters. Due to
the limited syntax of UnicodeData.txt, they are available only in
Unihan.txt and, as Andrew said, in extracted/DerivedNumericValues.txt
- the latter is small and easy to parse.
ICU4J's UCharacter class returns the numeric values for Han
characters. You could just use ICU instead of rolling your own.
http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#getUnicodeNumericValue(int)
http://www-306.ibm.com/software/globalization/icu/downloads.jsp
Best regards,
markus
On 2/17/06, Kit Peters <popefelix@gmail.com> wrote:
> Well, *I'm* only interested in numbers, but the larger project that I'm
> working within covers all of Unicode.
>
> On 2/17/06, Andrew West < andrewcwest@gmail.com> wrote:
> > On 17/02/06, Kit Peters < popefelix@gmail.com> wrote:
> > > 1) Is there a native Java way to retreive the numeric values for these
> > > characters (i.e. a way that doesn't involve me parsing Unihan.txt)?
> >
> > If you're only interested in numbers, why not parse the following
> > files directly, instead of UnicodeData.txt and Unihan.txt. They cover
> > all characters defined as numbers by Unicode, including CJK
> > ideographs.
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.
This archive was generated by hypermail 2.1.5 : Fri Feb 17 2006 - 12:40:08 CST