Re: Display of Isolated Nonspacing Marks (problems with UAX#29)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Aug 10 2003 - 19:24:44 EDT

  • Next message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"

    On Sunday, August 10, 2003 9:17 PM, Peter Kirk <peter.r.kirk@ntlworld.com> wrote:

    > On 10/08/2003 10:09, Michael Everson wrote:
    >
    > > At 01:30 +0200 2003-08-10, Philippe Verdy wrote:
    > >
    > > > Whateer you think, the SPACE+diacritic is still a hack, and
    > > > certainly not a canonical equivalent (including for its
    > > > properties), of the existing spacing diacritics, which also do
    > > > not fit all usages because they are symbols.
    > >
    > >
    > > It is the formally specified way to represent what you say you want
    > > to represent. If an implementation doesn't do that nicely enough,
    > > complain to the implementors. (This has already been suggested to
    > > you.)

    Example of problem with SPACE+diacritics in UAX#29:

    - Grapheme clusters:
    "One or more Unicode characters may make up what the user thinks of as a character or basic unit of the language. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster. For example, “G” + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode code points. For more information on the ambiguity in the term character, see UTR #17: Character Encoding Model
    (...)
    Grapheme clusters commonly commonly behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. When this is done, for example, and an accented character is represented by a combining character sequence, then using the right arrow key would skip from the start of the base character to the end of the last combining character."

    So combining sequences like SPACE+diacritics are grapheme clusters.

    - Word boundaries:
    "(rule 3) Treat a grapheme cluster as if it were a single character: the first character of the cluster.
         GC → FC"

    This seems to be the only rule that is appropriate to relate to combining sequences and combining characters, which are ignored otherwise in the other rules. So SPACE+diacritics is handled like SPACE.

    - Sentence boundaries:
    "(rule 4) Treat a grapheme cluster as if it were a single character: the first character of the cluster.
         GC → FC"

    Same problem.

    - " 6.1 Normalization
    Although boundaries are specified in terms of NFD text, in practice normalization is not required. The Grapheme Cluster specification has a number of features to are to ensure that the same results are returned for canonically equivalent text. It will not break within a sequence of non-spacing marks, which is the only part that can reorder in the formation of NFD. Nor is there ever a break between a base character and subsequent non-spacing marks. It also has a special set of characters marked as having the Extend property value, such as U+09BE ( ◌া ) BENGALI VOWEL SIGN AA, to deal with particular compositions.
    The other default boundary specifications never break within grapheme clusters, and always use a consistent property value for each grapheme cluster as a whole."
    This just specifies that there will be no break between the base character SPACE and its diacritics, but says nothing about possible breaks after or before the combining sequence.
    - "6.2 Grapheme Cluster and Format Rules
    The first rule for the default word and sentence specifications is to treat a grapheme cluster as a single character: the first character of the cluster. This would be equivalent to making the following changes to the subsequent rules.
    (...)
    Insert Extend* after every boundary property value — except after the final property after the break symbol.
    Thus X Y × Z W becomes X Extend* Y Extend* × Z Extend* W .
    Thus X Y × becomes X Extend* Y Extend* ×"

    So rules like "X SPACE × Z" become "X Extend* SPACE Extend* × Z", whose one instance is "X SPACE diacritics × Z"

    This is also confirmed by the fact that normalization is explicitly NOT required to process text boundaries, which is exactly the place where the use of SPACE causes the most important problems for text processing and rendering.

    ---
    Similar problems occur with UAX#14 for Line breaking, which forgot the case of SPACE+diacritics handled there as if it were the first character of the sequence. What is worse is this description:
    "SP - Space (A) - (normative)
     0020 SPACE (SP)
    The space characters are explicit break opportunities, but spaces at the end of a line are not measured for fit. If there is a sequence of space characters, and breaking after any of the space characters would result in the same visible line, the line breaking position after the last space character in the sequence is the locally most optimal one. In other words, since the last character measured for fit is before the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line. NOTE: SPACE, but none of the other breaking spaces, is used in determining an indirect break."
    This statement clearly ignores the existence of SPACE+diacritics... Same thing for:
    "ZW - Zero Width Space (A) - (normative)
     200B ZERO WIDTH SPACE (ZWSP)
    This character does not have width. It is used to enable additional (invisible) break opportunities wherever SPACE cannot be used."
    This shows that ZWSP+diacritics would not work either for Hebrew isolated diacritics (with missing implied letter).
    Note that these two ZW and SP classes of characters are *normative*. Another proof that SPACE+diacritics is really a hack causing lots of problems in the Unicode main standard and its standard annexes.
    Now similar problems also exist in UAX#9 (the BiDi algorithm), which also describes problematic normative properties like the neutrality of the SPACE character for mixed directionality: where would the SPACE+diacritics be displayed if there's a directionality change on either side of this combining sequence? such problem does not occur with existing spacing diacritics handled regularly like symbols:
    "3.3.3. Resolving Weak Types
    Weak types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used.
    Non-spacing marks are now resolved based on the previous characters.
    W1. Examine each non-spacing mark (NSM) in the level run, and change the type of the NSM to the type of the previous character. If the NSM is at the start of the level run, it will get the type of sor.
    Assume in this example that sor is R:
      AL  NSM NSM => AL  AL  AL
      sor NSM     => sor R"
    Nothing is said elsewhere about diacritics, but here SPACE does not match the "AL" linebreaking category. So the representation is still undefined here...
    "L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed."
    As SPACE is directionality neutral, the diacritic applied on it will be also directionality neutral, and will inherit the direction of the previous grapheme cluster. Other areas in UAX#9 covering joiners/disjoiners also will cause problems: how can we join/disjoin a spacing diacritic if it is encoded with a SPACE base character plus combining diacritics?
    Will I need to say more about this SPACE+diacritics legacy hack, and the many problems or non interoperable solutions offered by various implementations to solve this problem?
    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Sun Aug 10 2003 - 19:57:53 EDT