Re: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 15 2005 - 20:12:32 CST

  • Next message: Doug Ewell: "Re: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law)"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > Elaine, the good news for you is that if you order your Unicode Hebrew
    > text according to these 'alternative combining classes' you will not be
    > deviating at all from the Unicode standard. Your text will not be
    > normalised in any of the standard normalisation forms, but the standard
    > nowhere specifies that texts must be normalised. Of course you need to
    > ensure that your text is not normalised by other processes, or that if it
    > is you then restore it to the order of the 'alternative combining
    > classes' - a process which should be reversible.

    Note that you can't define "alternative combining classes" the way you want,
    if you need to preserve canonincal equivalence.

    Notably:

    (1) you can't change a non-zero combining class into a zero combining class
    (and the reverse as well): what this means is that starters will remain
    starters and non-starters in combining sequences will remain non-starter.

    (2) if you change the combining class of some character without changing it
    as well for other combining characters in the same class, the result is that
    you may break canonical equivalence, as the new classes may be reordered
    freely, when the characters sharing the same standard class would have
    remained in their relative order. One example: changing the combining class
    of the upper-right form of the cedilla to match its special positioning on
    some letters, without changing as well the combining class of all other
    diacritics attached below, will break canonical equivalence.

    In other words: all unicode characters are in a strictly partitioned space,
    defined as distinct set of characters sharing the same combining class.
    These subsets are immutable (you can't move one character from one subset to
    another, without breaking the canonical equivalence).

    However these subsets are numbered quite arbitrarily (from 0 to 255), but
    the absolute or even relative value of this number has no importance (except
    for class 0, and for the normalization forms where only the relative order
    of non-0 combining classes matters).

    For these reasons, there's no way to have the combining class values match
    all actual positioning interactions of combining characters. The assigned
    "names" to combining classes are not accurate in all cases and represent
    just an approximation. If the relative order of two diacritics in a
    combining sequence is important, they must either share the same combining
    class, or be separated by a non-starter control (like CGJ, or ZWJ and ZWNJ).

    Also the restriction on combining class 0 means that there's no way to
    transform a single grapheme cluster encoded by two successive but separate
    combining sequences, into a single combining sequence (this will be
    important for most Indian and South-East Asian scripts, with some known
    interactions between combining sequences, notably those with VIRAMA-like
    characters).

    Correct processing of text cannot depend only combining sequences. So the
    impact of "incorrect" relative order of combining classes is very tiny,
    given that it is not the proper level of abstraction to handle these cases.

    So if you need to change combining classes into custom ones for rendering
    purpose, you will do that as part of the processing that allows transforming
    a string from logical to physical order. This may create identical results
    from strings that are initially canonically different, and users won't be
    able to see any difference when they look characters at the grapheme cluster
    level!

    So, the concept of canonical equivalence and combining classes is not
    adapted to linguistic analysis, and it is not even enough for some security
    related encodings (where the better concept to use should be the collation
    of strings). It is just a simplification of a more complex case, that allows
    reducing (sometimes) the number of possibilities to encode the logically
    "same" string, and the number of strings that should be recognized.
    Unfortunately, this does not reduce this number to 1 and only 1 (but this
    can be paliated by orthographic conventions applied to encoded texts, such
    as suggested in UTN #19).



    This archive was generated by hypermail 2.1.5 : Sun Jan 16 2005 - 12:05:53 CST