From: jon@spin.ie
Date: Thu Sep 25 2003 - 06:03:41 EDT
> > Hi,
> > I'm currently experimenting with various trade-offs for Unicode
> normalisation code. Any comments on these (particularly of the "that's
> insane, here's why, stop now!" variety) would be welcome.
>
> You might want to look at, if not even use, the ICU open-source
> implementation:
>
> http://oss.software.ibm.com/icu/
> http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp
I did, but when I started this I was more interested in simply comparing various optimisations as a study into the related techniques. However I recently hit a practical need for such code for another task, and while it's nice that I've a bunch of "work" code already done as "fun" code maybe I should just use ICU...
> > The second is an optimisation of both speed and size, with the
> disadvantage that data cannot be shared between NFC and NFD operations (which
> is perhaps a reasonable trade in the case of web code which might only need NFC
> code to be linked). In this version decompositions of stable codepoints are
> ommitted from the decompositon data. For example since following the
> decomposition <U+0104> -> <U+0041, U+0328> there can be no
> character that is unblocked from the U+0041 that will combine with it, hence
> there is no circumstance in which they will not be recombined to U+0104 and
> hence dropping that decomposition from the data will not affect NFC (the
> relevant data would still have to be in the composition table, as the sequence
> <U+0041, U+0328> might occur in the source code).
>
> Sounds possible and clever. As far as I remember, ICU uses the normalization
> quick check flags
> (Unicode properties) to determine much of this, and should achieve the same in
> most cases.
The above would supplement use of quick check - indeed it would be a way of implementing the concept of "stable codepoints" that the UTR suggests using with quick check.
This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 06:51:39 EDT