L2/04-421

Re:

Issues for minimal match

From:

Mark Davis, Jianping Yang, Claire Ho

Date:

2004-11-18

Issue

During implementation of linguistic searching and replacing algorithm, we just encountered some issues with minimal match that is defined by DS4 in the specification of UTS #10 Unicode Collation Algorithm. Definition DS4 (Informative) defines a minimal match for a collation with ignorable characters:

DS4. The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P match at Q[s,e].

       By using minimal matches, the issue with ignorables is avoided.

By this minimal matching definition, it can give a deterministic match result, but this is not the only way to avoid the issue with ignorable characters for the match. Another way to avoid the issue is to use greedy match that matches to the maximal substring that has the same collation keys as the ones for the pattern. The maximal match would have benefits for certain operations as below:

. If P ≡ Q, then the maximal match for P is Q. This will keep the semantics consistent.

. The maximal match for Q will give the first character position that has the match.

. Used with minimal match, it can determine how many characters in the match string are leading and trailing ignorable characters.

 

Another issue with the minimal match is the performance to handle the leading and trailing ignorables for the pattern. For one time match, a performance gain can be realized if we can avoid the lookup of collation key for a character, so most optimized collation algorithms implemented in commercial software would do the comparison in character binary encoding value first and if they are equal, the comparison can go to next character provided there is good handling for the ending of the string. But with the minimal match definition, we cannot do this binary comparison first as even if they are equal, we have to check if the character in the match pattern is ignorable as we need to find the right beginning and ending position for the minimal match definition. There is a way to improve minimal match performance to preprocess the match pattern to remove the leading and trailing ignorable characters, so that the match can be done in the way that it can apply binary match first. But this preprocessing still needs to look up the collation table for leading and trailing characters. For the character set that its string cannot be scanned backward, this preprocessing still needs to look up the collation table for every character in the pattern. This would force the match algorithm to do more collation table lookup which could be several times slower down compared to the one without the collation table lookup for the worst case.

We need a new definition that includes the binary matched leading or trailing ignorables into the match result in addition to the minimal match. This new definition is termed medium match. The medium match would have following benefits:

. If P = Q, then the medium match for P is Q.

. It will have higher match performance that can apply binary match first without need to do the preprocessing for the match pattern.

Recommendations

We would recommend to update the minimal match definition to make it not as the only match and add two new definitions for maximal and medium match, so that the application can pick up one that most meet its need:

DS4. The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we say that P minimal matches at Q[s,e].

       By using minimal matches, the issue with ignorables is avoided.

 

DS8. The match is maximal if for all positive i and j, there is no match at Q[s-i,e+j]. In such a case, we say that P maximal matches at Q[s,e].

       By using maximal matches, the issue with ignorables is also avoided.

DS9. The match is medium [Ed note: we may want a different term for this] when it contains the minimal match, and is extended beyond whenever there is a binary match between the extra characters in pattern and target. [Ed note: we can make this language more exact, but the example below captures the meaning.]
    By using medium matches, it can give a deterministic result with a higher performance in some implementations

As an example, consider the pattern "*!abc!*" and the target text "def$%!Abc%!$ghi", and the collation strength is set to ignore punctuation and case are ignored. Then the minimal match is "Abc", the maximal match is "$%!Abc%!$" and the medium match is "!Abc".