Re: starters and non-starters

From: spir (
Date: Wed Feb 03 2010 - 05:19:45 CST

  • Next message: spir: "Re: multi-stage decomposition?"

    On Tue, 2 Feb 2010 18:53:20 -0700
    "Doug Ewell" <> wrote:

    > > COMBINING DOT ABOVE followed by LATIN SMALL LETTER D would not be a
    > > valid sequence, correct, but you should start working from the d, not
    > > the code that follows. After all, the "d" by itself *IS* a valid
    > > sequence, whether or not a combining character comes after it. It's
    > > the orphaned combining dot that is defective.
    > There's another problem with spir's original statement. You can't say
    > that "a source text holding <0307, 0064> is illegal" because the U+0307
    > might not be orphaned at all, but might be preceded by another base
    > character. The bracketed text [ėd] consists of the sequence <0065,
    > 0307, 0064> and is perfectly legal.
    > Perhaps spir meant "a source text containing *only* that sequence" or
    > "starting with that sequence." This is a nitty detail, but when dealing
    > with an inherently stateful concept like combining sequences, nitty
    > details matter.

    What I meant is: is it legal to encode a "user-perceived character" in really great disorder, eg with a combining mark following what obviously is the base character. In the example, having the <dot above> come first. I interpret your answers meaning no, it's illegal.
    The consequence would be that only following characters can be disordered. If codes are already "stacked" (into grouped combining sequences) before normalization, then we can safely ignore "stacks" with less than 3 codes; _and_ start reordering from the 3rd code on. Pseudocode:

    foreach stack in stacks do
        size = size(stack)
        if size < 3 then
            next # next stack
        # kind of bubble sort, but ignoring first code
            no_swap = true
            for i=3 to size do # here index base = 1
                code1, code2 = stack[i-1], stack[i]
                ccc1, ccc2 = getCCC(code1), getCCC(code2)
                if ccc1 > ccc2 then
                    <swap codes>
                    no_swap = false
        until no_swap

    So, if "stacks" are built before normalization, only a small proportion of combining sequences *possibly* require reordering (and an even smaller proportion actually are reordered).

    Side-question: why are disordered combining sequences even allowed?


    la vita e estrany

    This archive was generated by hypermail 2.1.5 : Wed Feb 03 2010 - 05:23:46 CST