Re: Unicode Normalisaton Optimisation Experiments

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Sep 24 2003 - 18:55:18 EDT

  • Next message: Deepayan Sarkar: "Re: Mojibake on my Web pages"

    Jon Hanna wrote:
    > Hi,
    > I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.

    You might want to look at, if not even use, the ICU open-source implementation:

    http://oss.software.ibm.com/icu/
    http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp

    > The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. ...

    I believe that this is a very common technique. Used in ICU.

    > The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).

    Sounds possible and clever. As far as I remember, ICU uses the normalization quick check flags
    (Unicode properties) to determine much of this, and should achieve the same in most cases.

    markus



    This archive was generated by hypermail 2.1.5 : Wed Sep 24 2003 - 19:41:59 EDT