Re: ZWJ, ZWNJ, CGJ and combination

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 10 2003 - 03:54:01 EST

  • Next message: Philippe Verdy: "Re: Hex-byte pictures (WAS: RE: Hexadecimal digits?)"

    There's still a problem between these "clarified" definitions, introduced by D14:
    a.. "a combining character is a graphic character" means it must be a graphic character, and this excludes character category "Cf".
    a.. "Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Non-Spacing Mark (Mn), and Enclosing Mark (Me)."

    Thanks, now all graphic characters are now in a complete and exclusive partition as either base characters or combining characters. And the whole Unicode code points set is then mapped in a partition between: base characters, combining characters, and non-graphic characters (these includes non-characters).

    As a combining character sequence is made only of a optional base character and combining character, it must then only include graphic characters, and so all non-graphic characters ("gc=C*", except "gc=Co" without private agreement) are excluded from all occurences of combining character sequences.

    ZWJ/ZWNJ are then excluded from any combining character sequence. But not CGJ as it is a combining character (Mn) and thus a graphic character.

    The problem:
      a.. CGJ, because it is now "clearly" a graphic character, should not be excluded from having both a graphic behavior (needed for Hebrew), and a semantic (so it impacts collation or text transformations like case mappings or other foldings), even if it is invisible and has no associated glyph (in D13a point 3: "Not all graphic characters have visibly rendered glyphs. Particular examples include spaces and some combining marks.")...
      b.. ZWJ and ZWNJ (Cf) are not graphic characters, but the way they are used in Khmer, do not obey these definitions as they participate to combining character sequences...

      ----- Original Message -----
      From: Mark Davis
      To: Peter Kirk ; Unicode List
      Sent: Sunday, November 09, 2003 12:52 AM
      Subject: Re: ZWJ, ZWNJ, CGJ and combination

      The UTC just approved a clarification of the base character definition, as follows:

      D13a Graphic character: a character with the General Categories of Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).

        a.. Graphic characters specifically exclude the line and paragraph separators (Zl, Zp) and exclude the characters with the General Categories of Other (Cn, Cs, Cc, Cf).
        b.. For more information, see Chapter 2, especially Section 2.4 Code Points and Characters and Table 2-2 Types of Code Points.
        c.. Not all graphic characters have visibly rendered glyphs. Particular examples include spaces and some combining marks.
        d.. The interpretation of private use characters (Co) as graphic characters or not is determined by private agreement. However, in the absence of private agreement, private use characters should be interpreted as graphic characters.
      D13b Base character: any graphic character except for those with the General Category of Combining Mark (M).

        a.. Most Unicode characters are base characters. A base character is any code point that has one of the General Categories of Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).
        b.. Base characters are independent graphic characters, but this does not preclude the presentation of base characters from adopting different contextual forms or participating in ligatures.
        c.. The interpretation of private use characters (Co) as base characters or not is determined by private agreement. However, in the absence of private agreement, private use characters should be interpreted as base characters.
      D14 Combining character: a graphic character with the General Category of Combining Mark (M).

        a.. The graphic positioning of a combining character depends on the last preceding base character. The combining character is said to apply to that base character.
        b.. Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Non-Spacing Mark (Mn), and Enclosing Mark (Me).
        c.. All characters with non-zero canonical combining class (Cc) are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
        d.. The interpretation of Private Use characters (Co) as combining characters or not is determined by private agreement.

      Mark
      __________________________________
      http://www.macchiato.com
      ► शिष्यादिच्छेत्पराजयम् ◄
       
      ----- Original Message -----
      From: "Peter Kirk" <peterkirk@qaya.org>
      To: "Unicode List" <unicode@unicode.org>
      Sent: Sat, 2003 Nov 08 11:58
      Subject: ZWJ, ZWNJ, CGJ and combination

    > Are the characters ZWJ, ZWNJ and CGJ base characters, combining
    > characters, neither, or even both? Which specific character properties
    > should I look at to decide this?
    >
    > Are these characters legal within combining character sequences? Can ZWJ
    > and ZWNJ be used to control ligation of combining characters? If not, is
    > there an alternative mechanism for this?
    >
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 10 2003 - 04:43:36 EST