Re: Unicode Normalisaton Optimisation Experiments

From: jon@spin.ie
Date: Thu Sep 25 2003 - 06:03:41 EDT

  • Next message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"

    > > Hi,
    > > I'm currently experimenting with various trade-offs for Unicode
    > normalisation code. Any comments on these (particularly of the "that's
    > insane, here's why, stop now!" variety) would be welcome.
    >
    > You might want to look at, if not even use, the ICU open-source
    > implementation:
    >
    > http://oss.software.ibm.com/icu/
    > http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp

    I did, but when I started this I was more interested in simply comparing various optimisations as a study into the related techniques. However I recently hit a practical need for such code for another task, and while it's nice that I've a bunch of "work" code already done as "fun" code maybe I should just use ICU...

    > > The second is an optimisation of both speed and size, with the
    > disadvantage that data cannot be shared between NFC and NFD operations (which
    > is perhaps a reasonable trade in the case of web code which might only need NFC
    > code to be linked). In this version decompositions of stable codepoints are
    > ommitted from the decompositon data. For example since following the
    > decomposition <U+0104> -> <U+0041, U+0328> there can be no
    > character that is unblocked from the U+0041 that will combine with it, hence
    > there is no circumstance in which they will not be recombined to U+0104 and
    > hence dropping that decomposition from the data will not affect NFC (the
    > relevant data would still have to be in the composition table, as the sequence
    > <U+0041, U+0328> might occur in the source code).
    >
    > Sounds possible and clever. As far as I remember, ICU uses the normalization
    > quick check flags
    > (Unicode properties) to determine much of this, and should achieve the same in
    > most cases.

    The above would supplement use of quick check - indeed it would be a way of implementing the concept of "stable codepoints" that the UTR suggests using with quick check.



    This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 06:51:39 EDT