Re: ISO 8859-11 (Thai) cross-mapping table

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue Oct 08 2002 - 11:59:38 EDT

  • Next message: John Cowan: "Re: ISO 8859-11 (Thai) cross-mapping table"

    We use this structure in ICU (in UnicodeSet). For a high-level explanation,
    see my site (www.macchiato.com), "Bits of Unicode".

    As to the binary search, we have used in various contexts before a
    "completely unrolled" binary loop, following John Bentley's formulation. It
    is quite an interesting structure. Here is a Java version: it is about 40%
    faster than the straightforward binary search (this is not a production
    version).

        /**
         * @return greatest index such that data[index] <= searchValue
         * If there is no such index (e.g. searchValue < data[0]), then -1 is
    returned
         */

     public int highestIndexLEQ(int searchValue) {

      if (!isValid) validate();
      int temp;

      // set up initial range to search. Each subrange is a power of two in
    length
      int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;

      // Completely unrolled binary search, following "Programming Pearls"
      // Each case deliberately falls through to the next
      // Logically, data[-1] < all_search_values && data[count] >
    all_search_values
      // although the values -1 and count are never actually touched.

      // The bounds at each point are low & high,
      // where low == high - delta*2
      // so high - delta is the midpoint

      // The invariant AFTER each line is that data[low] < searchValue <=
    data[high]

      switch (power) {
      //case 31: if (searchValue < data[temp = high-0x40000000]) high = temp; //
    no unsigned int in Java
      case 30: if (searchValue < data[temp = high-0x20000000]) high = temp;
      case 29: if (searchValue < data[temp = high-0x10000000]) high = temp;

      case 28: if (searchValue < data[temp = high- 0x8000000]) high = temp;
      case 27: if (searchValue < data[temp = high- 0x4000000]) high = temp;
      case 26: if (searchValue < data[temp = high- 0x2000000]) high = temp;
      case 25: if (searchValue < data[temp = high- 0x1000000]) high = temp;

      case 24: if (searchValue < data[temp = high- 0x800000]) high = temp;
      case 23: if (searchValue < data[temp = high- 0x400000]) high = temp;
      case 22: if (searchValue < data[temp = high- 0x200000]) high = temp;
      case 21: if (searchValue < data[temp = high- 0x100000]) high = temp;

      case 20: if (searchValue < data[temp = high- 0x80000]) high = temp;
      case 19: if (searchValue < data[temp = high- 0x40000]) high = temp;
      case 18: if (searchValue < data[temp = high- 0x20000]) high = temp;
      case 17: if (searchValue < data[temp = high- 0x10000]) high = temp;

      case 16: if (searchValue < data[temp = high- 0x8000]) high = temp;
      case 15: if (searchValue < data[temp = high- 0x4000]) high = temp;
      case 14: if (searchValue < data[temp = high- 0x2000]) high = temp;
      case 13: if (searchValue < data[temp = high- 0x1000]) high = temp;

      case 12: if (searchValue < data[temp = high- 0x800]) high = temp;
      case 11: if (searchValue < data[temp = high- 0x400]) high = temp;
      case 10: if (searchValue < data[temp = high- 0x200]) high = temp;
      case 9: if (searchValue < data[temp = high- 0x100]) high = temp;

      case 8: if (searchValue < data[temp = high- 0x80]) high = temp;
      case 7: if (searchValue < data[temp = high- 0x40]) high = temp;
      case 6: if (searchValue < data[temp = high- 0x20]) high = temp;
      case 5: if (searchValue < data[temp = high- 0x10]) high = temp;

      case 4: if (searchValue < data[temp = high- 0x8]) high = temp;
      case 3: if (searchValue < data[temp = high- 0x4]) high = temp;
      case 2: if (searchValue < data[temp = high- 0x2]) high = temp;
      case 1: if (searchValue < data[temp = high- 0x1]) high = temp;
      }
      if (high == topOfHigh && searchValue >= data[high]) return high;
      return high-1;
     }

     // NOTE: on some machines the above may not be optimal, if the size of the
    function
     // forces code out of the cache. For that case, it would be better for
    program in a loop, like the following

     public int highestIndexLEQ2(int searchValue) {

      if (!isValid) validate();
      int temp;
      int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;
      for (int delta = deltaStart; delta != 0; delta >>= 1) {
                if (searchValue < data[temp = high-delta]) high = temp;
            }
      if (high == topOfHigh && searchValue >= data[high]) return high;
      return high-1;
     }

     // ================ Privates ================

     // data

     int data[];
     int count;

        // validate internal parameters

     private void validate() {
         if (count <= 1) throw new IllegalArgumentException("Array must have at
    least 2 elements");

      // find greatest power of 2 less than or equal to count
      for (power = exp2.length-1; power > 0 && exp2[power] > count; power--) {}

      // determine the starting points
      topOfLow = exp2[power] - 1;
      topOfHigh = count - 1;
      deltaStart = exp2[power-1];
      isValid = true;
     }

     private boolean isValid = false;
     private int topOfLow;
     private int topOfHigh;
     private int power;
     private int deltaStart;

     private static final int exp2[] = {
         0x1, 0x2, 0x4, 0x8,
         0x10, 0x20, 0x40, 0x80,
         0x100, 0x200, 0x400, 0x800,
         0x1000, 0x2000, 0x4000, 0x8000,
         0x10000, 0x20000, 0x40000, 0x80000,
         0x100000, 0x200000, 0x400000, 0x800000,
         0x1000000, 0x2000000, 0x4000000, 0x8000000,
         0x10000000, 0x20000000 // , 0x40000000 // no unsigned int in Java
     };

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
    To: "John Cowan" <jcowan@reutershealth.com>
    Cc: <unicode@unicode.org>; <xom-interest@lists.ibiblio.org>
    Sent: Tuesday, October 08, 2002 06:25
    Subject: Re: ISO 8859-11 (Thai) cross-mapping table

    > At 8:44 AM -0400 10/8/02, John Cowan wrote:
    >
    >
    > >The underlying data structure here is called a "range table", and is
    > >a list of ranges in codepoint order, expressed thus:
    > >
    > > start of first range
    > > end of first range + 1
    > > start of second range
    > > end of second range + 1
    > >
    > >etc. etc. What you are doing is equivalent to a linear search over
    > >this range followed by loop unrolling. However, you can do better,
    > >especially in complex cases, with a *binary* search over the range
    > >followed by loop unrolling. The trick here is that if the binary
    > >search returns an even value, it succeeds; an odd value fails.
    > >
    >
    > Interesting. Do you have any references on that I can explore
    > further? A quick google search didn't turn up anything relevant. I'm
    > curious to see how the algorithm actually works.
    > --
    >
    > +-----------------------+------------------------+-------------------+
    > | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
    > +-----------------------+------------------------+-------------------+
    > | XML in a Nutshell, 2nd Edition (O'Reilly, 2002) |
    > | http://www.cafeconleche.org/books/xian2/ |
    > | http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/ |
    > +----------------------------------+---------------------------------+
    > | Read Cafe au Lait for Java news: http://www.cafeaulait.org/ |
    > | Read Cafe con Leche for XML news: http://www.cafeconleche.org/ |
    > +----------------------------------+---------------------------------+
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Oct 08 2002 - 12:45:37 EDT