Re: Small Java implementation of NFC

From: Markus Scherer (markus.icu@gmail.com)
Date: Mon Mar 07 2005 - 18:56:28 CST

  • Next message: Kenneth Whistler: "Re: Encoded rendering instructions (was Unicode's Mandate)"

    The easiest may be to trim the ICU4J normalization code.

    Out of the box, the ICU4J normalization is "hefty" because it does a
    number of non-core things for which it uses other parts of ICU. This
    is good because it does all things for all people, and internally uses
    other parts of ICU rather than reinventing the wheel, but it can make
    it big even if you only care about one piece.

    In particular, ICU can perform normalization not only according to the
    current version of Unicode, but also specifically for Unicode 3.2 (for
    StringPrep/IDNA; exception: this does not undo the normalization
    corrections). This is done by using a UnicodeSet of the characters
    that were unassigned in Unicode 3.2. API-wise, this is a bit
    (UNICODE_3_2) of the options parameter of many functions. There are
    some other bits that result in different sets.

    You should be able to do something like this in recent ICU4J code:

    1. In NormalizerImpl.java, remove internalGetNX() and its related
    functions that build these sets. Change getNX() to return null. Change
    nx_contains() to return false.

    2. Provide a dummy UnicodeSet class. You probably need a few functions
    like add() but I believe that Normalizer itself does not need
    UnicodeSet at all except for the getNX() stuff, so the dummy class
    could really just throw away everything.

    3. Also, you probably don't need the functions for testing if strings
    are canonically equivalent. In NormalizerImpl, remove the whole
    section that starts with
    /* compare canonically equivalent ---------------------------------------- */
    up to just before
    * Status of tailored normalization

    There is some more possible, you get the idea. Just with the above,
    you should be able to trim it quite far. You should not need
    UCharacter.java and uprops.icu, for example.

    Also, the current ICU4C build (ICU4C builds the unorm.icu data file)
    allows to build a partial (smaller) unorm.icu file. The gennorm
    generator can omit data for canonical closure, compatibility
    decomposition, and others.

    I bet it's easier to trim ICU4J normalization than to write NFC code
    from scratch.

    Best regards,
    markus

    PS: The smallest and slowest code for NFC is probably the UAX #15 sample code.

    On Thu, 03 Mar 2005 08:29:13 -0500, Elliotte Harold
    <elharo@metalab.unc.edu> wrote:
    > Currently my Java library (XOM) is dragging along a hefty chunk (344K)
    > of IBM's open source ICU just to support one rarely invoked method that
    > converts strings into NFC. I'd like to get rid of that. Given the nature
    > of my application it is more important to me to be able to eliminate the
    > extra jar file and its size, than it is to have the fastest, most
    > intelligent NFC algorithm.

    -- 
    Opinions expressed here may not reflect my company's positions unless
    otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Mon Mar 07 2005 - 18:57:14 CST