starters and non-starters

From: spir (
Date: Tue Feb 02 2010 - 12:49:50 CST

  • Next message: spir: "multi-stage decomposition?"


    From the doc:
    "Starter: Any code point (assigned or not) with combining class of zero (ccc=0).
    The term Starter refers, in concept, to the starting character of a combining character sequence (D56), because all combining character sequences except defective combining character sequences (D57) commence with a ccc=0 character—in other words, they start with a Starter.
    Reorderable Pair: Two adjacent characters A and B in a coded character sequence <A,B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0."

    This means, I guess, that a combining character sequence's first character is guaranteed to be a starter, ie to have ccc=0. I cannot find whether the converse statement is true: is a following character guaranteed to be a non-starter? This would mean: "all combining character sequences except defective combining character sequences (D57) continue with ccc!=0"

    For sure, some combinations are made of characters that can also be stand-alone, eg combined ideograms. But unicode may have implemented those characters with non-spacing variants like for diacritics.

    The reason I ask is an obvious way (for me) to perform normalization (in my case NFD: decomposition + ordering) is to operate on already "stacked" combining sequences which I will make anyway (according to the algorithm for so-called grapheme clusters). Then, both normalization stages proceed inside such "code stacks". If the answer to the above question is yes, then it is not necessary to check whether a following code's ccc is > 0.

    Also, these definitions seem to imply that a combining sequence cannot be originally defined with the base following a combining mark, eg that a source text holding <U+0307 combining dot above, U+0064 latin small letter d> is simply illegal. Is this true? If yes, a sequence of 2 codes can only be properly ordered and we can safely start reordering from the *third* code.
    Else, what does "all combining character sequences except defective combining character sequences (D57) commence with a ccc=0" mean in practice? More precisely, how are we supposed to interpret "defective"? The following
    "D57 Defective combining character sequence: A combining character sequence that does
         not start with a base character.
       â€¢ Defective combining character sequences occur when a sequence of combining
         characters appears at the start of a string or follows a control or format charac-
         ter. Such sequences are defective from the point of view of handling of combin-
         ing marks, but are not ill-formed. (See D84.)"
    is not clear enough for me.

    In the following table,
    Sequence Combining Reorderable? Reason
    <a, acute> 0, 230 No ccc(A)=0
    <acute, a> 230, 0 No ccc(B)=0
    <diaeresis, acute> 230, 230 No ccc(A)=ccc(B)
    <cedilla, acute> 202, 230 No ccc(A)<ccc(B)
    <acute, cedilla> 230, 202 Yes ccc(A)>ccc(B)
    the second example seems to show an acute accent ending a combining sequence, followed by an 'a' starting (or beeing the only code of) a next sequence. Thus, if codes are already stacked, these ones reside in separate stacks, eg [... (e, acute), (a, grave), ...], and they will not even be compared for reordering.


    la vita e estrany

    This archive was generated by hypermail 2.1.5 : Tue Feb 02 2010 - 12:55:05 CST