Unicode Normalisaton Optimisation Experiments

From: Jon Hanna (jon@spin.ie)
Date: Wed Sep 24 2003 - 10:09:50 EDT

  • Next message: Doug Ewell: "Re: Mojibake on my Web pages"

    I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the "that's insane, here's why, stop now!" variety) would be welcome.

    The first is an optimisation of speed over size. Rather than perform the decomposition as a recursive operation the necessary data is stored to do so in a single pass. For example rather than compute <U+212B> -> <U+00C5> -> <U+0041, U+030A> recursively one can store the data to compute <U+212B> -> <U+0041, U+030A>. This reduces the amount of work to decompose each character, and further benefits from the fact that if there is no trailing combining characters (that is if the next character is a starter) then no re-ordering is required.

    The second is an optimisation of both speed and size, with the disadvantage that data cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade in the case of web code which might only need NFC code to be linked). In this version decompositions of stable codepoints are ommitted from the decompositon data. For example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence <U+0041, U+0328> might occur in the source code).

    This archive was generated by hypermail 2.1.5 : Wed Sep 24 2003 - 11:02:56 EDT