From: spir (email@example.com)
Date: Tue Feb 02 2010 - 12:49:50 CST
From the doc:
"Starter: Any code point (assigned or not) with combining class of zero (ccc=0).
The term Starter refers, in concept, to the starting character of a combining character sequence (D56), because all combining character sequences except defective combining character sequences (D57) commence with a ccc=0 character—in other words, they start with a Starter.
Reorderable Pair: Two adjacent characters A and B in a coded character sequence <A,B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0."
This means, I guess, that a combining character sequence's first character is guaranteed to be a starter, ie to have ccc=0. I cannot find whether the converse statement is true: is a following character guaranteed to be a non-starter? This would mean: "all combining character sequences except defective combining character sequences (D57) continue with ccc!=0"
For sure, some combinations are made of characters that can also be stand-alone, eg combined ideograms. But unicode may have implemented those characters with non-spacing variants like for diacritics.
The reason I ask is an obvious way (for me) to perform normalization (in my case NFD: decomposition + ordering) is to operate on already "stacked" combining sequences which I will make anyway (according to the algorithm for so-called grapheme clusters). Then, both normalization stages proceed inside such "code stacks". If the answer to the above question is yes, then it is not necessary to check whether a following code's ccc is > 0.
Also, these definitions seem to imply that a combining sequence cannot be originally defined with the base following a combining mark, eg that a source text holding <U+0307 combining dot above, U+0064 latin small letter d> is simply illegal. Is this true? If yes, a sequence of 2 codes can only be properly ordered and we can safely start reordering from the *third* code.
Else, what does "all combining character sequences except defective combining character sequences (D57) commence with a ccc=0" mean in practice? More precisely, how are we supposed to interpret "defective"? The following
"D57 Defective combining character sequence: A combining character sequence that does
not start with a base character.
• Defective combining character sequences occur when a sequence of combining
characters appears at the start of a string or follows a control or format charac-
ter. Such sequences are defective from the point of view of handling of combin-
ing marks, but are not ill-formed. (See D84.)"
is not clear enough for me.
In the following table,
Sequence Combining Reorderable? Reason
<a, acute> 0, 230 No ccc(A)=0
<acute, a> 230, 0 No ccc(B)=0
<diaeresis, acute> 230, 230 No ccc(A)=ccc(B)
<cedilla, acute> 202, 230 No ccc(A)<ccc(B)
<acute, cedilla> 230, 202 Yes ccc(A)>ccc(B)
the second example seems to show an acute accent ending a combining sequence, followed by an 'a' starting (or beeing the only code of) a next sequence. Thus, if codes are already stacked, these ones reside in separate stacks, eg [... (e, acute), (a, grave), ...], and they will not even be compared for reordering.
la vita e estrany
This archive was generated by hypermail 2.1.5 : Tue Feb 02 2010 - 12:55:05 CST