Re: Re: ZWJ, ZWNJ, CGJ and combination

From: Philippe Verdy (
Date: Sun Nov 09 2003 - 17:55:12 EST

  • Next message: Peter Kirk: "Re: ZWJ, ZWNJ, CGJ and combination"

    From: "Peter Kirk" <>
    > >Not at all ! May be with supplementary markup of my sentence
    > >it will be more clear:
    > > A "starter sequence" (defective or not) is then an
    > > _unordered_ set of {
    > > _ordered_ sequences of {
    > > characters having the same combining class
    > > }
    > > }.
    > >Then look at where I used the term "set" defined by this sentence, and
    > >term "element" refers to element of the unordered set, i.e. the "ordered
    > >sequence of characters having the same combining class".
    > >
    > >
    > OK, this time you are right and I am wrong; although your definition
    > does not include all canonically equivalent orderings of your "starter
    > sequence" because it excludes ones in which a combining character in
    > class b is ordered between two of class a, a not equal to b.

    Here again this definition is clear: the coded sequence <a1, b, a2>
    contains the unordered set { <a1, a2>, <b> }, why do you want that
    a1 and a2 are in separate elements of the set when they match the
    definition of "characters having the same combining class".

    Note however that the sentence is taken out of its context, which also
    indicates a definition constraint for "starter sequences". More formally:

        - if the unordered set contains an element which is an
        ordered sequence of characters of combining class 0 (starters),
        then this sequence must contain only one character, and this
        character must be the first one coded in the starter sequence.
        - if such element is present, the "starter sequence" is "non-defective"
        else it is "defective".

    An interesting property is that a defective starter sequence is necessarily
    also part of a defective combining sequence.

    But the reverse is false: a "defective combining sequence" is not
    made of any "defective starter sequence".

    For example: <LF, COMBINING ACCUTE> is a *non-defective* starter sequence,
    but contains the defective combining sequence <COMBINING ACCUTE>, after
    the isolated <LF> control (which is not technically a combining sequence,
    is not defective).

    The interest of that definition is that almost all Unicode algorithms are
    actually working on very basic "starter sequences", and not on "combining
    character sequences" which can be parsed only after precise definition of
    character properties.

    And canonical normalization _guarantees_ to preserve *only* "starter
    sequences" (defective or not), but not necessarily "combining character
    sequences" (defective or not), and that's where care must be taken when
    encoding text...

    This archive was generated by hypermail 2.1.5 : Sun Nov 09 2003 - 18:30:44 EST