From: Elliotte Harold (email@example.com)
Date: Thu Mar 03 2005 - 07:29:13 CST
Currently my Java library (XOM) is dragging along a hefty chunk (344K)
of IBM's open source ICU just to support one rarely invoked method that
converts strings into NFC. I'd like to get rid of that. Given the nature
of my application it is more important to me to be able to eliminate the
extra jar file and its size, than it is to have the fastest, most
intelligent NFC algorithm.
Thus I'm looking at ways to implement NFC that don't require me to drag
around the Unicode data files or substantial chunks thereof. I notice
that java.lang.Character has a getType method that returns the Unicode
character class for each character. This is built-in to Java since 1.1.
It lets me tell if a character is one of the following types:
Is this sufficient information to perform NFC normalization? Or is that
a pipe dream and I'm just going to need to drag along the Unicode data
file or part thereof?
Looking at it NON_SPACING_MARK, MODIFIER_LETTER, and MODIFIER_SYMBOL
seem like they would cover the composition half of the NFC algorithm.
However, I don't see anything that would let me perform the
decomposition half of NFC, so I may just have to carry around the parts
of the Unicode data file I need after all.
FYI, this all needs to work in Java 1.2 and later (and ideally in Java
1.1 though I'm willing to compromise on that) so classes and methods
that only show up in 1.4 and later aren't an option. I know there's
normalization code hidden inside the sun classes (I filed an RFC at Sun
to make that public) but I don't really want to depend on that either
since I'm not sure how many VMs have the right classes.
-- Elliotte Rusty Harold firstname.lastname@example.org XML in a Nutshell 3rd Edition Just Published! http://www.cafeconleche.org/books/xian3/ http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 07:30:08 CST