Small Java implementation of NFC

From: Elliotte Harold (elharo@metalab.unc.edu)
Date: Thu Mar 03 2005 - 07:29:13 CST

  • Next message: Elliotte Harold: "Bad Content-type headers on Unicode web site?"

    Currently my Java library (XOM) is dragging along a hefty chunk (344K)
    of IBM's open source ICU just to support one rarely invoked method that
    converts strings into NFC. I'd like to get rid of that. Given the nature
    of my application it is more important to me to be able to eliminate the
    extra jar file and its size, than it is to have the fastest, most
    intelligent NFC algorithm.

    Thus I'm looking at ways to implement NFC that don't require me to drag
    around the Unicode data files or substantial chunks thereof. I notice
    that java.lang.Character has a getType method that returns the Unicode
    character class for each character. This is built-in to Java since 1.1.
    It lets me tell if a character is one of the following types:

    COMBINING_SPACING_MARK
    CONNECTOR_PUNCTUATION
    CONTROL
    CURRENCY_SYMBOL
    DASH_PUNCTUATION
    DECIMAL_DIGIT_NUMBER
    ENCLOSING_MARK
    END_PUNCTUATION
    FINAL_QUOTE_PUNCTUATION
    FORMAT
    INITIAL_QUOTE_PUNCTUATION
    LETTER_NUMBER
    LINE_SEPARATOR
    LOWERCASE_LETTER
    MATH_SYMBOL
    MODIFIER_LETTER
    MODIFIER_SYMBOL
    NON_SPACING_MARK
    OTHER_LETTER
    OTHER_NUMBER
    OTHER_PUNCTUATION
    OTHER_SYMBOL
    PARAGRAPH_SEPARATOR
    PRIVATE_USE
    SPACE_SEPARATOR
    START_PUNCTUATION
    SURROGATE
    TITLECASE_LETTER
    UNASSIGNED
    UPPERCASE_LETTER

    Is this sufficient information to perform NFC normalization? Or is that
    a pipe dream and I'm just going to need to drag along the Unicode data
    file or part thereof?

    Looking at it NON_SPACING_MARK, MODIFIER_LETTER, and MODIFIER_SYMBOL
    seem like they would cover the composition half of the NFC algorithm.
    However, I don't see anything that would let me perform the
    decomposition half of NFC, so I may just have to carry around the parts
    of the Unicode data file I need after all.

    FYI, this all needs to work in Java 1.2 and later (and ideally in Java
    1.1 though I'm willing to compromise on that) so classes and methods
    that only show up in 1.4 and later aren't an option. I know there's
    normalization code hidden inside the sun classes (I filed an RFC at Sun
    to make that public) but I don't really want to depend on that either
    since I'm not sure how many VMs have the right classes.

    Any suggestions?

    -- 
    Elliotte Rusty Harold  elharo@metalab.unc.edu
    XML in a Nutshell 3rd Edition Just Published!
    http://www.cafeconleche.org/books/xian3/
    http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
    


    This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 07:30:08 CST