L2/05-028

Further Collation Items

Mark Davis, 2005-01-24

A. Other changes for Thai, Lao

The following were missed when adjustments were made for Thai/Lao. So delete the following sections:

3.1.3 Rearrangement

Certain characters are not coded in logical order, such as the Thai vowels เ through ไ and the Lao vowels ເ through ໄ (this list is indicated by the Logical_Order_Exception property). For collation, they are rearranged by swapping with the following character before further processing, since logically they belong afterwards. For example, here is a string processed by rearrangement:

input string: 0E01 0E40 0E02 0E03
normalized string: 0E01 0E02 0E40 0E03

in 8 Searching and Matching (Informative)

  1. Certain Thai and Lao vowels are swapped with the preceding character. For example, the text string “เข” (...\u0E40\u0E02...) is modified internally in collation to “ขเ” (...\u0E02\u0E40...). This may mean that a string logically matches a discontiguous section of another string. If, however, the vowels are considered to be part of a grapheme cluster, then this situation is handled by the "whole grapheme clusters only" option.

B. Matching Type Interactions.

The interactions of other conditions with the matching types (minimal, maximal, medial) needs to be clarified. Consider the following.

  Value Notes
Pattern: abc  
Strength: primary thus ignoring combining marks, punctuation
Text: abc¸-°d two combining marks, cedilla and ring
Matches: |abc|¸|-|°|d four possible endpoints, indicated by |

When an additional condition is set on the match, the types (minimal, maximal, medial) are based on the matches that meet that condition. Thus if the condition is Whole Grapheme, then the matches are restricted to "abc¸|-°|d", thus discarding match positions that would not be on a grapheme cluster boundary. Thus the minimal match would be "abc¸|-°d"

The changes to the text would include explaining the above situation in the introductory text in that section, and changing DS5 and moving it. Suggestion the following:

Delete current DS5.

Add

DS1a. A boundary condition is a test imposed on an offset within a string. Examples include Whole Grapheme Cluster Search and Whole Word Search, as defined in UAX #29. See [Breaks]).

By using grapheme-complete conditions, contractions and combining sequences are not interrupted. This also avoids the need to present visually discontiguous selections to the user (except for BIDI text).

Revise the following:

Suppose there is a collation C, a pattern string P and a target string Q. C has some particular set of attributes, such as a strength setting, and choice of variable weighting.

DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e].

to

Suppose there is a collation C, a pattern string P and a target string Q, and a boundary condition B. C has some particular set of attributes, such as a strength setting, and choice of variable weighting.

DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the condition B.

DS2b A match is grapheme-complete if B requires that the offset be at a grapheme cluster boundary. Note that Whole Word Search as defined in UAX #29 is grapheme complete. See [Breaks]).

I think we should also add some more explanatory text about combining marks. Those can be a bit tricky!

C. Matching and Searching claims

We don't give a way for people to specifically claim conformance to matching and searching according to Section 8. Suggest (a) removing "(informative)" from the title, and (b) adding:

C5 An implementation claiming conformance to Matching and Searching according to UAX #10, shall meet the requirements described in Section 8.