Re: ISO 8859-11 (Thai) cross-mapping table

From: John Cowan (
Date: Tue Oct 08 2002 - 12:35:08 EDT

  • Next message: Patrick Andries: "Re: InDesign"

    Mark Davis scripsit:

    > We use this structure in ICU (in UnicodeSet). For a high-level explanation,
    > see my site (, "Bits of Unicode".

    Inversion lists, yes. (Different terminology, same thing).

    > As to the binary search, we have used in various contexts before a
    > "completely unrolled" binary loop, following John Bentley's formulation.

    Ah, that's a different kind. I was talking about recursive unrolling, with
    the content of the data structure appearing as constants in the code.
    Naturally, this is barbarian-style programming, should really be done with a
    program generator, and if not, is suitable only for circumstances where the
    values never change. But consider the inversion list for Latin letters:

            0041 005B 0222 0234
            0061 007B 1E00 1E9C
            00C0 00D7 1EA0 1EFA
            00D8 00F7 FF21 FF3B
            00F8 0220 FF41 FF5B

    you can unwrap the binary search as:

            if (u < 0x0220) {
                if (u < 0x00C0) {
                    if (u < 0x0061) {
                        if (u < 0x005B) return !(u < 0x0041);
                        else return false;
                    else return u < 0x007B;
                else {
                    if (u < 0x00F7) {
                        if (u < 0x00D8) return u < 0x00D7;
                        else return false;
                    else return !(u < 0x00F8);
            else { ... # and so on for the other half

    If the list is too long, this code also suffers because it overflows the
    instruction cache.

    Said Agatha Christie / To E. Philips Oppenheim	John Cowan
    "Who is this Hemingway? / Who is this Proust?
    Who is this Vladimir / Whatchamacallum,
    This neopostrealist / Rabble?" she groused.
    	--author unknown to me; any suggestions?

    This archive was generated by hypermail 2.1.5 : Tue Oct 08 2002 - 13:32:20 EDT