Re: ch ligature in a monospace font from Philippe Verdy on 2011-06-30 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 1 Jul 2011 04:22:59 +0200

2011/7/1 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> On Fri, 1 Jul 2011 01:57:46 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
>> CGJ is NOT made to create (or even hint) ligatures ; and certainly not
>> in this context.
>
> Its main purpose is to indicate that a sequence of characters do
> not form a collating unit. However, if one is using a 'monospace' font
> to space 'letters' uniformly, i.e. to space collating sequences evenly,
> then I suggest it is the appropriate character.

Its main use is in fact to prevent reordering of otherwise canonically
equivalent sequences involving combining characters. CGJ is part of
the SAME grapheme cluster as the revious character (and possibly the
following combining characters). As grapheme clusters should not be
broken in the middle by collating elements, this clearly indicates
that CGJ will not be the appropriate character, because collating
elements will be made of <C,CGJ> then <H> separately.

CGJ however MAY have a visual impact on the rendering (notably because
it helps fixing the relative order of sequences combining characters
with non-zero combining classes, exactly because these combining
characters may be positioned relatively to each other). But here I
don't see any problem od relative ordering.

All that is suggested is to indicate the desired or undesired ligature
by some joining character between existing graphemes, without
reordering them such as rendering sequences <C,joiner,H> or
<C,joiner,APOS,joiner,H>. The joiner (or disjoiner) here should
clearly be ZWJ (or ZWNJ respectively). The APOS here may be the
ambiguous ASCII vertical quote ('), or preferably the right curly
quote (’).

And you said it yourself : its purpose (if it is used between two
grapheme clusters, i.e. just before a start character of the second
grapheme cluster, and not before a combining character with non-zero
combining class) "is to indicate that a sequence of characters do not
form a collating unit". In other words, <C,CGJ,H> or
<C,CGJ,APOS,CGJ,H> would not form the single collating elements really
needed for Breton.

On the opposite <G,CIRCUMFLEX,CGJ,CEDILLA> would collate as two
collating elements just like <G,CIRCUMFLEX> and then <CEDILLA>,
whereas <G,CIRCUMFLEX,CEDILLA> would collate like <G,CEDILLA> followed
by <CIRCUMFLEX>. The CGJ used between combining characters prevent
their implicit reordering, so it avoids the canonical equivalence when
the two diacritics involved here are swapped distinctly in the encoded
text (you won't see the difference with uppercase 'G' but you'll see
the difference with the lowercase 'g') .

This would play a difference if, for example, <G,CEDILLA> is tailored
for a language with a primary difference from <G> alone (i.e. treated
as if it was a distinct letter of the language alphabet), instead of a
secondary difference (or because the cedilla associated with a
lowercase 'g' is normally rendered with the cedilla above, looking
like an upper hook or some diacritic above 'g', instead of below it
where the leg of 'g' is placed: stacking diacritics below is usually
avoided in the Latin script).

You could also represent distinctly <G,CEDILLA> and <G,CGJ,CEDILLA> to
indicate that the implicit moving of the attached CEDILLA to the upper
position should not occur in the second case, meaning that the cedilla
would still attach below the leg.

You clearly need to make distinction of usage of CGJ based on the type
of character that follows it, i.e. combining or not :

- (1) But the "main" usage of CGJ is the first usage case, before a
combining character, that is part of the *same* grapheme cluster as
the character before CGJ.

- (2) The second usage case is certainly much more tricky (it could
potentially play a role much like the Indic Halant/Virama between two
base Indic letters, to alter the letter form of the first one, such
like adopting an half-form ; in the latin script, this could remove
the gap following the first letter, such attaching both the lower and
upper arm of a letter 'C', loosing much of its visual identity even
distinct from a simple ligature with the following letter). For now
I've not seen any justification of such usage of CGJ in the Latin
script, because these sort of special attachments of letter pairs is
preferably encoded by a new character assignment for the combination
(distinct from a simple ligature).

Clearly Breton does not need case (2) for CGJ : the suggested use was
with monospaced fonts to allow the digrams or trigrams to be rendered
with narrower glyphs to fit a single character colum and align
vertically in a console grid. Still the letters C,H (and the APOS)
will still be distinct, not even attached together, and their
character identity will still be preserved in such rendering : this is
a very minor ligaturing adjustment for specific rendering cases, but
not needed for normal rendering of Breton, where ZWJ will be enough
(and will generate the correct sequences of collating elements). The
case (1) for CGJ is also not what is wanted here.

-- Philippe.

>> Both ZWJ and ZWNJ will be ignored in collation.
>
> Whereas one needs, in theory at least, something to distinguish
> accidental sequences <ch> from the digraph <ch> if one is to avoid
> having to use dictionaries to collate.
Received on Thu Jun 30 2011 - 21:27:58 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 30 2011 - 21:28:15 CDT