Re: ZWJ, ZWNJ, CGJ and combination

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 10 2003 - 06:58:02 EST

  • Next message: Kent Karlsson: "RE: Tamil 0BB3 and 0BD7"

    On 09/11/2003 22:45, Philippe Verdy wrote:

    >From: "Peter Kirk" <peterkirk@qaya.org>
    >
    >
    >
    >>On 09/11/2003 14:55, Philippe Verdy wrote:
    >>
    >>
    >>
    >>>...
    >>>
    >>>And canonical normalization _guarantees_ to preserve *only* "starter
    >>>sequences" (defective or not), but not necessarily "combining character
    >>>sequences" (defective or not), and that's where care must be taken when
    >>>encoding text...
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>Surely not. A combining character sequence consists of an optional base
    >>character followed by one or more combining characters. Canonical
    >>normalisation preserves the sequence of combining characters only,
    >>although it may reorder this sequence. It also preserves without
    >>reordering the juxtaposition of this seuqence to the optional base
    >>character. Therefore the combining character sequence is preserved.
    >>
    >>
    >
    >That's where we differ:
    >The combining character sequence differs from what I define a starter
    >sequence:
    >(1) by the fact it can contain more than one class 0 characters (starters),
    >namely all class 0 combining characters (gc=Mn), and
    >(2) by the fact that a combining character sequence cannot contain some
    >class 0 characters (like unagreed PUAs controls and line/paragraph
    >separators which are treated individually, but not as a combining character
    >sequence).
    >
    >The second difference is less critical for us (what it does is that it
    >creates occurences of defective combining character sequences in the middle
    >of the text), but the first one is critical here...
    >
    >
    This does not affect my argument. A combining character sequence, as
    defined, does not perfectly fit your definition "an unordered set of
    sequences of characters having the same combining class." But it is
    preserved under canonical normalisation. Well, perhaps that depends what
    you mean by "preserved". If you mean that its code point representation
    is unchanged, that is not true your starter sequences either. If it
    means that its semantics are unchanged, it is true by definition of any
    string of Unicode characters that its semantics are unchanged by
    canonical normalisation, or indeed by any transformation into a
    canonically equivalent form.

    >I still maintain that there's no terminology to designate what I call a
    >starter sequence.
    >
    >
    >
    Agreed. But does it matter? It does so only if this is a meaningful unit
    within Unicode. On my understanding, a sequence of combining characters
    all of class >0 is meaningful because this is what canonical reordering
    operates on. But such a sequence does not necessarily form a unit with
    the preceding character.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Mon Nov 10 2003 - 07:40:49 EST