RE: Public Review Issues update: UAX #31

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 19:55:54 CDT

  • Next message: John Hudson: "Re: Needs help identifying script"

    They could have been defined in Table 4, but there was no need to exclude
    them in this same table from the definition of the “Sep” property. The
    proposed change in the SB* rules where “Sep” is referenced just make these
    rules unnecessarily longer and more difficult to read.

     

    May be this change is justified by the need to have a single property value
    for the same character, so that the implementation of table-based tailored
    breakers will be simpler: it will just be enough to define a single map from
    characters to their Sentence_Break property.

     

    So this means that CR and LF characters are assigned now a single CR and LF
    Sentence_Break property value, instead of having subclasses in the “Sep”
    class. From what you say, I see now the interest of the change: the table 4
    provides now a complete partition of the UCS into separate classes, each
    class being defined by the single property value mapped from each character.

     

    But if you wanted to make implementations simpler, it should be noted that
    some classes are unnecessarily split, despite they are currently assigned
    distinct property values. This is visible when you look at the chart: there
    are identical rows, and the associated columns also have identical contents.
    The charts could be simpler by merging rows and columns that have identical
    content, reducing the number of cases to consider within an implementation
    (and possibly the number of bits needed to encode these property values in
    the compiled map). This also means that the current set of rules could be
    simplified.

    Regarding the implementation, I tend to think that one way to compact the
    compiled map is to reorder the rows and columns in the chart to produce a
    triangular matrix: the result in the code implementing the algorithm is that
    the code can be written very efficiently as “if… then… elseif… then…
    elseif…then… elseif… then” without embedding multiple ifs, or if the “then…”
    actions are not simple branches, using binary lookup by grouping the cases
    according to their frequency (so that the minimum number of tests will be
    computed at runtime for the most frequent cases, building the weighted
    binary decision tree in a way similar to the Huffman compression of
    characters into variable-width bitfields).

     

    As the result of read from cells in the matrix is just “break here” or “no
    break here”, this can finally turn into a simple boolean expression, fully
    factorized producing the minimum number of branches in the compiled code,
    and the fastest code (thanks to branch prediction), without even using a
    bidimensional table lookup (which is not very efficient due to the needed
    computing of cell addresses, to the extra indirection, and inefficient use
    of the data cache for locality in processor).

     

    I have already used this approach (transforming complex multidimensional
    rules into a triangular matrix (or triangular hypermatrix if there are more
    than 2 dimensions for the lookup) then transformed into a single Boolean
    expression) and this really enhanced a lot the speed of the code at runtime
    (less cache misses, better branch prediction that does not depend on runtime
    profiling but is correct even at the beginning of evaluation, and finally a
    better throughput) when working with very large volumes of data.

     

      _____

    De : mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] De la
    part de Mark Davis
    Envoyé : mercredi 26 septembre 2007 01:22
    À : verdy_p@wanadoo.fr
    Cc : Rick McGowan; unicode@unicode.org
    Objet : Re: Public Review Issues update: UAX #31

     

    Someone complained that while CR and LF are used in the rules, they were not
    defined in the table of values for Sentence Break. So the UTC decided to
    make this change for consistency. It doesn't make any change in the actual
    outcome. (Personally, I don't think it was of much value, but don't feel
    that strongly against it either.)

    Mark



    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 19:57:06 CDT