From: Elliotte Harold (elharo@metalab.unc.edu)
Date: Thu Mar 03 2005 - 07:29:13 CST
Currently my Java library (XOM) is dragging along a hefty chunk (344K)
of IBM's open source ICU just to support one rarely invoked method that
converts strings into NFC. I'd like to get rid of that. Given the nature
of my application it is more important to me to be able to eliminate the
extra jar file and its size, than it is to have the fastest, most
intelligent NFC algorithm.
Thus I'm looking at ways to implement NFC that don't require me to drag
around the Unicode data files or substantial chunks thereof. I notice
that java.lang.Character has a getType method that returns the Unicode
character class for each character. This is built-in to Java since 1.1.
It lets me tell if a character is one of the following types:
COMBINING_SPACING_MARK
CONNECTOR_PUNCTUATION
CONTROL
CURRENCY_SYMBOL
DASH_PUNCTUATION
DECIMAL_DIGIT_NUMBER
ENCLOSING_MARK
END_PUNCTUATION
FINAL_QUOTE_PUNCTUATION
FORMAT
INITIAL_QUOTE_PUNCTUATION
LETTER_NUMBER
LINE_SEPARATOR
LOWERCASE_LETTER
MATH_SYMBOL
MODIFIER_LETTER
MODIFIER_SYMBOL
NON_SPACING_MARK
OTHER_LETTER
OTHER_NUMBER
OTHER_PUNCTUATION
OTHER_SYMBOL
PARAGRAPH_SEPARATOR
PRIVATE_USE
SPACE_SEPARATOR
START_PUNCTUATION
SURROGATE
TITLECASE_LETTER
UNASSIGNED
UPPERCASE_LETTER
Is this sufficient information to perform NFC normalization? Or is that
a pipe dream and I'm just going to need to drag along the Unicode data
file or part thereof?
Looking at it NON_SPACING_MARK, MODIFIER_LETTER, and MODIFIER_SYMBOL
seem like they would cover the composition half of the NFC algorithm.
However, I don't see anything that would let me perform the
decomposition half of NFC, so I may just have to carry around the parts
of the Unicode data file I need after all.
FYI, this all needs to work in Java 1.2 and later (and ideally in Java
1.1 though I'm willing to compromise on that) so classes and methods
that only show up in 1.4 and later aren't an option. I know there's
normalization code hidden inside the sun classes (I filed an RFC at Sun
to make that public) but I don't really want to depend on that either
since I'm not sure how many VMs have the right classes.
Any suggestions?
-- Elliotte Rusty Harold elharo@metalab.unc.edu XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 07:30:08 CST