Re: [hebrew] Re: ZWJ, ZWNJ, CGJ and combination

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sat Nov 08 2003 - 23:29:44 EST

  • Next message: Simon Butcher: "RE: Hexadecimal digits?"

    You are stating many things as if they were facts, when they are simply not
    true. You should verify them against the definitions before stating them in such
    a 'definitive' way.

    Examples:
    - VS1 is a combining character, and not a base character.
    http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=FE00

    - Default grapheme clusters do not include ZWJ; as a matter of fact, default
    grapheme clusters, except for Hangul Jamo Syllables and a few exceptional cases,
    are identical with combining sequences.
    http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

    - *Tailored* grapheme clusters may include longer sequences, but it is not at
    all obvious whether they would contain ever ZWJ or ZWNJ.

    >...rendering of text works on grapheme clusters
    - Rendering units are, in general, orthogonal to whether a sequence is a
    grapheme cluster or not. "fi" may be a ligature in English, but is certainly not
    a grapheme cluster.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Peter Kirk" <peterkirk@qaya.org>
    Cc: <unicode@unicode.org>
    Sent: Sat, 2003 Nov 08 17:15
    Subject: Re: [hebrew] Re: ZWJ, ZWNJ, CGJ and combination

    > I'm curious about what name you would give to it.
    > The name COMBINING CHARACTER JOINER is already used...
    >
    > In all our discussions we should have used the term "starter" (instead of
    > just "base character" which is ambiguous) for any characters of combining
    > class 0 and which include:
    >
    > Base characters (includes conjoining characters):
    > letter, syllable or ideograph (gc=L*),
    > number (gc=N*),
    > punctuation (gc=P*),
    > symbol (gc=S*),
    > space (gc=Zs)
    > agreed private use characters (gc=Co and private agreement)
    > Starter Combining characters:
    > (gc=M* and CC=0) such as CGJ
    > Controls:
    > (gc=C* except Co),
    > Text separators:
    > (gc=Zl, Zp)
    > Unknown private use characters:
    > (gc=Co and no private agreement)
    >
    > For other characters with combining class > 0, we should have used the term
    > "non-starter", not the term "combining character" which may or may not be a
    > "starter".
    >
    > It is clear however that we made a distinction between "combining sequences"
    > (made of a unique starter and optionally followed by non-starters) and
    > "grapheme clusters" (which are made of one or more combining sequences). For
    > example, the (hypothetic) encoded text:
    >
    > <ALEF, ZWJ, LAMED, VAV, VS1, HOLAM, NUN, METEG, CGJ, HATAF PATAH>
    >
    > is made of 7 "combining sequences":
    >
    > <ALEF>,
    > <ZWJ>,
    > <LAMED,
    > <VAV>,
    > <VS1, HOLAM>,
    > <NUN, HATAF PATAH>,
    > <CGJ, METEG>
    >
    > (where the starters are VAV, VS1, NUN, CGJ),
    > and 3 "grapheme clusters":
    >
    > <ALEF, ZWJ, LAMED,
    > <VAV, VS1, HOLAM>,
    > <NUN, HATAF PATAH, CGJ, METEG>
    >
    > (ZWJ is a format control and ignored in the determination of grapheme
    > cluster boundaries).
    >
    > Grapheme clusters may be created by grouping several combining sequences
    > without using CGJ, ZWJ, ZWNJ, or variant selectors: see examples in South
    > Asian scripts, and with Hangul Jamos.
    >
    > Generally, collation and rendering of text works on grapheme clusters (or
    > groups of these clusters with language-specific tailoring); but not on
    > combining sequences whose role is either related to string identity
    > excluding any concept of relative order (i.e. normalization and canonical
    > equivalence), or to text transforms or folding.
    >
    > Compatibility equivalence is also defined but neither on combining
    > sequences, nor on grapheme clusters: there may be a mapping from one
    > character (i.e. only a part of a combining sequence) to several characters
    > that belong to distinct combining sequences and distinct grapheme clusters,
    > for example with some ligatures of base letters (example: the "ffi"
    > ligature, which participates to only 1 combining sequence and only 1
    > grapheme cluster, is mapped to 3 distinct combining sequences and 3 distinct
    > grapheme clusters).
    >
    > ----- Original Message -----
    > From: "Peter Kirk" <peterkirk@qaya.org>
    > To: <hebrew@unicode.org>
    > Sent: Sunday, November 09, 2003 1:20 AM
    > Subject: [hebrew] Re: ZWJ, ZWNJ, CGJ and combination
    >
    >
    > > So that you don't hold try to your breath over the weekend to find out
    > > what I am planning to propose, as announced on the main Unicode list...
    > >
    > > The issue in question is the ligation of hataf vowels and meteg. Hataf
    > > vowels with medial meteg are clear cases of ligatures between the basic
    > > vowels and meteg. But there seems to be no mechanism in Unicode so far
    > > to promote such a ligature. So, my suggestion is to propose a new
    > > combining character COMBINING CHARACTER JOINER (combining class zero),
    > > defined with semantics similar to ZWJ rather than CGJ i.e. to affect
    > > ligation but not collation.
    > >
    > > Comments?
    > >
    > > --
    > > Peter Kirk
    > > peter@qaya.org (personal)
    > > peterkirk@qaya.org (work)
    > > http://www.qaya.org/
    > >
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Nov 09 2003 - 00:11:21 EST