Re: Small Java implementation of NFC

From: Markus Scherer ([email protected])
Date: Mon Mar 07 2005 - 18:56:28 CST

Next message: Kenneth Whistler: "Re: Encoded rendering instructions (was Unicode's Mandate)"

Previous message: David Starner: "Re: Unicode abuse"
In reply to: Elliotte Harold: "Small Java implementation of NFC"
Next in thread: Rick McGowan: "Re: Small Java implementation of NFC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The easiest may be to trim the ICU4J normalization code.

Out of the box, the ICU4J normalization is "hefty" because it does a
number of non-core things for which it uses other parts of ICU. This
is good because it does all things for all people, and internally uses
other parts of ICU rather than reinventing the wheel, but it can make
it big even if you only care about one piece.

In particular, ICU can perform normalization not only according to the
current version of Unicode, but also specifically for Unicode 3.2 (for
StringPrep/IDNA; exception: this does not undo the normalization
corrections). This is done by using a UnicodeSet of the characters
that were unassigned in Unicode 3.2. API-wise, this is a bit
(UNICODE_3_2) of the options parameter of many functions. There are
some other bits that result in different sets.

You should be able to do something like this in recent ICU4J code:

1. In NormalizerImpl.java, remove internalGetNX() and its related
functions that build these sets. Change getNX() to return null. Change
nx_contains() to return false.

2. Provide a dummy UnicodeSet class. You probably need a few functions
like add() but I believe that Normalizer itself does not need
UnicodeSet at all except for the getNX() stuff, so the dummy class
could really just throw away everything.

3. Also, you probably don't need the functions for testing if strings
are canonically equivalent. In NormalizerImpl, remove the whole
section that starts with
/* compare canonically equivalent ---------------------------------------- */
up to just before
* Status of tailored normalization

There is some more possible, you get the idea. Just with the above,
you should be able to trim it quite far. You should not need
UCharacter.java and uprops.icu, for example.

Also, the current ICU4C build (ICU4C builds the unorm.icu data file)
allows to build a partial (smaller) unorm.icu file. The gennorm
generator can omit data for canonical closure, compatibility
decomposition, and others.

I bet it's easier to trim ICU4J normalization than to write NFC code
from scratch.

Best regards,
markus

PS: The smallest and slowest code for NFC is probably the UAX #15 sample code.

On Thu, 03 Mar 2005 08:29:13 -0500, Elliotte Harold
<[email protected]> wrote:
> Currently my Java library (XOM) is dragging along a hefty chunk (344K)
> of IBM's open source ICU just to support one rarely invoked method that
> converts strings into NFC. I'd like to get rid of that. Given the nature
> of my application it is more important to me to be able to eliminate the
> extra jar file and its size, than it is to have the fastest, most
> intelligent NFC algorithm.

-- 
Opinions expressed here may not reflect my company's positions unless
otherwise noted.

Next message: Kenneth Whistler: "Re: Encoded rendering instructions (was Unicode's Mandate)"
Previous message: David Starner: "Re: Unicode abuse"
In reply to: Elliotte Harold: "Small Java implementation of NFC"
Next in thread: Rick McGowan: "Re: Small Java implementation of NFC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Mar 07 2005 - 18:57:14 CST