# starters and non-starters

From: spir (denis.spir@free.fr)
Date: Tue Feb 02 2010 - 12:49:50 CST

• Next message: spir: "multi-stage decomposition?"

Hello,

From the doc:
"Starter: Any code point (assigned or not) with combining class of zero (ccc=0).
[...]
The term Starter refers, in concept, to the starting character of a combining character sequence (D56), because all combining character sequences except defective combining character sequences (D57) commence with a ccc=0 characterâ€”in other words, they start with a Starter.
[...]
Reorderable Pair: Two adjacent characters A and B in a coded character sequence <A,B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0."

This means, I guess, that a combining character sequence's first character is guaranteed to be a starter, ie to have ccc=0. I cannot find whether the converse statement is true: is a following character guaranteed to be a non-starter? This would mean: "all combining character sequences except defective combining character sequences (D57) continue with ccc!=0"

For sure, some combinations are made of characters that can also be stand-alone, eg combined ideograms. But unicode may have implemented those characters with non-spacing variants like for diacritics.

The reason I ask is an obvious way (for me) to perform normalization (in my case NFD: decomposition + ordering) is to operate on already "stacked" combining sequences which I will make anyway (according to the algorithm for so-called grapheme clusters). Then, both normalization stages proceed inside such "code stacks". If the answer to the above question is yes, then it is not necessary to check whether a following code's ccc is > 0.

Also, these definitions seem to imply that a combining sequence cannot be originally defined with the base following a combining mark, eg that a source text holding <U+0307 combining dot above, U+0064 latin small letter d> is simply illegal. Is this true? If yes, a sequence of 2 codes can only be properly ordered and we can safely start reordering from the *third* code.
Else, what does "all combining character sequences except defective combining character sequences (D57) commence with a ccc=0" mean in practice? More precisely, how are we supposed to interpret "defective"? The following
"D57 Defective combining character sequence: A combining character sequence that does
â€¢ Defective combining character sequences occur when a sequence of combining
characters appears at the start of a string or follows a control or format charac-
ter. Such sequences are defective from the point of view of handling of combin-
ing marks, but are not ill-formed. (See D84.)"
is not clear enough for me.

In the following table,
Sequence Combining Reorderable? Reason
Classes
<a, acute> 0, 230 No ccc(A)=0
<acute, a> 230, 0 No ccc(B)=0
<diaeresis, acute> 230, 230 No ccc(A)=ccc(B)
<cedilla, acute> 202, 230 No ccc(A)<ccc(B)
<acute, cedilla> 230, 202 Yes ccc(A)>ccc(B)
the second example seems to show an acute accent ending a combining sequence, followed by an 'a' starting (or beeing the only code of) a next sequence. Thus, if codes are already stacked, these ones reside in separate stacks, eg [... (e, acute), (a, grave), ...], and they will not even be compared for reordering.

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/

This archive was generated by hypermail 2.1.5 : Tue Feb 02 2010 - 12:55:05 CST