L2/03-236

Source: Ken Whistler
Date: August 5, 2003

Rick, Lisa, Cathy,

Here is the other email I'd like inserted into the L2 document
trail and be added to the agenda under the Biblical Hebrew
discussion.

This one summarizes the reasons why CGJ itself is the correct
choice.

I'm submitting these emails because this topic has been undergoing
a very long and detailed thrashing on the unicoDe list, rather
than the unicoRe list. Some UTC members might not be aware
of the significant technical discussion and its architectural
implications there, but the discussion had to occur on the
open list because of the importance of the input from the
various Biblical Hebrew scholars concerned with the implementation
issues.

--Ken

------------- Begin Forwarded Message -------------

Date: Wed, 23 Jul 2003 15:07:12 -0700 (PDT)
Subject: Re: Yerushala(y)im - or Biblical Hebrew
To: peter.r.kirk@ntlworld.com
Cc: unicode@unicode.org, kenw@sybase.com
X-archive-position: 7337
X-list: unicode

Peter Kirk cited Paul Nelson:


> On 23/07/2003 03:20, Paul Nelson (TYPOGRAPHY) wrote:
> 

> >Please look at the definition of GCJ and other such characters.
> >Understand the differences between CGJ and ZWJ/ZWNJ.
> >
> >This discussion is very disturbing to me because after reading through
> >the L2 document register it is unclear what is the difference between
> >GCJ and ZWJ use.


Things will get easier shortly when the full (final!) text of Unicode
4.0 is posted online. The relevant discussion is in Section 15.2
Layout Controls. Some excerpts:

===================================================================

U+200D ZERO WIDTH JOINER is intended to produce a more connected
rendering of adjacent characters than would otherwise be the case,
if possible. ...

U+200C ZERO WIDTH NON-JOINER is intended to break both cursive
connections and ligatures in rendering. ...

                                       -- TUS 4.0, p. 390
                                       
U+034F COMBINING GRAPHEME JOINER is used to indicate that adjacent
characters are to be treated as a unit for the purposes of
language-sensitive collation and searching. In language-sensitive
collation and searching, the combining grapheme joiner should be
ignored unless it specifically occurs within a tailored collation
element mapping. ...

For rendering, the combining grapheme joiner is invisible.
However, some older implementations may treat a sequence of grapheme
clusters linked by combining grapheme joiners as a single unit
for the application of enclosing combining marks. ...

The combining grapheme joiner must not be confused with the
zero width joiner or the word joiner, which have very different
functions. In particular, inserting a combining grapheme joiner
between two characters should have no effect on their ligation or
cursive joining behavior. ...

                                      -- TUS 4.0, p. 392
                                      
====================================================================


> >The fact that you desire a control character to not be treated as such
> >greatly concerns me. 


As Mark Davis pointed out, CGJ is *not* a control character, if
by control character is meant gc=Cc (the ISO control characters)
or gc=Cf (the Unicode format control characters). Its general
category is Mn (with cc=0), which makes it formally a *combining mark*,
not a control character.


> >This really feels like people are trying to figure
> >out any way to twist existing constructs to avoid fixing the
> >normalization weights. I am alarmed from the implications of putting
> >control characters in place to somehow subvert the normalization.


There is no "subversion" of normalization involved here. Normalization
continues to work just as it always has, with no changes. There is
also no cause for alarm.

I have been talking about CGJ because someone initially had
suggested some kind of control character to adjust normalization
or modify combining classes (which *would* be alarming and perverse),
and then we cast around to figure out what would happen if
any of the existing format control characters (such as ZWJ or ZWNJ)
was inserted into these Hebrew vowel sequences.

As it turns out, CGJ is just the ticket, because:

  A. It is not a format control character, but a combining mark.
  
  B. It is defined *not* to influence the format of neighboring
     characters.
     
  C. It is, itself, invisible.
  
  D. It is already in the standard. (since Unicode 3.2)
  
  E. It is defined, by default, to be ignored in searches --
     since it becomes significant in collation/searching only
     when tailored in combinations with other characters.
     
  F. Its combining class is zero.
     
  G. And most importantly, when inserted between two Hebrew
     points in a sequence, it has precisely the required
     effects for normalized Hebrew text, enabling the preservation
     of point ordering distinctions in normalized contexts.
     
 
> As for the details of CGJ, please tell me where I can find a detailed 
> definition, and where it is specifically stated that a *rendering 
> engine* is obliged to process this *internally* as a control character - 
> and what precisely it is supposed to do with it if it does.


There is no such obligation on a rendering engine.

And if the implementers of rendering engines will simply "paint"
instances of U+034F so that they become available to the font
side of the rendering equation, then it should be relatively
simple, as for the Biblical Hebrew point sequence cases, to
get the <lamed, patah, CGJ, hiriq> sequences to display properly.


> I am now 
> wondering if anyone understands what this character is supposed to be or 
> do. If this is not clearly defined anywhere, perhaps UTC needs to write 
> a clear definition. At least Ken Whistler seems to think that it is 
> appropriate for this use. 


Yes, I do -- as does Mark Davis.


> Meanwhile, if despite this CGJ is not in fact 
> appropriate for this function, maybe we should propose a new character 
> which does have the appropriate properties.


CGJ *does* have the appropriate properties. So proposing a new
character would simply postpone resolution of the problem for
Biblical Hebrew.

--Ken


------------- End Forwarded Message -------------