L2/03-236 Source: Ken Whistler Date: August 5, 2003 Rick, Lisa, Cathy, Here is the other email I'd like inserted into the L2 document trail and be added to the agenda under the Biblical Hebrew discussion. This one summarizes the reasons why CGJ itself is the correct choice. I'm submitting these emails because this topic has been undergoing a very long and detailed thrashing on the unicoDe list, rather than the unicoRe list. Some UTC members might not be aware of the significant technical discussion and its architectural implications there, but the discussion had to occur on the open list because of the importance of the input from the various Biblical Hebrew scholars concerned with the implementation issues. --Ken ------------- Begin Forwarded Message ------------- Date: Wed, 23 Jul 2003 15:07:12 -0700 (PDT) Subject: Re: Yerushala(y)im - or Biblical Hebrew To: peter.r.kirk@ntlworld.com Cc: unicode@unicode.org, kenw@sybase.com X-archive-position: 7337 X-list: unicode Peter Kirk cited Paul Nelson: > On 23/07/2003 03:20, Paul Nelson (TYPOGRAPHY) wrote: > > >Please look at the definition of GCJ and other such characters. > >Understand the differences between CGJ and ZWJ/ZWNJ. > > > >This discussion is very disturbing to me because after reading through > >the L2 document register it is unclear what is the difference between > >GCJ and ZWJ use. Things will get easier shortly when the full (final!) text of Unicode 4.0 is posted online. The relevant discussion is in Section 15.2 Layout Controls. Some excerpts: =================================================================== U+200D ZERO WIDTH JOINER is intended to produce a more connected rendering of adjacent characters than would otherwise be the case, if possible. ... U+200C ZERO WIDTH NON-JOINER is intended to break both cursive connections and ligatures in rendering. ... -- TUS 4.0, p. 390 U+034F COMBINING GRAPHEME JOINER is used to indicate that adjacent characters are to be treated as a unit for the purposes of language-sensitive collation and searching. In language-sensitive collation and searching, the combining grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. ... For rendering, the combining grapheme joiner is invisible. However, some older implementations may treat a sequence of grapheme clusters linked by combining grapheme joiners as a single unit for the application of enclosing combining marks. ... The combining grapheme joiner must not be confused with the zero width joiner or the word joiner, which have very different functions. In particular, inserting a combining grapheme joiner between two characters should have no effect on their ligation or cursive joining behavior. ... -- TUS 4.0, p. 392 ==================================================================== > >The fact that you desire a control character to not be treated as such > >greatly concerns me. As Mark Davis pointed out, CGJ is *not* a control character, if by control character is meant gc=Cc (the ISO control characters) or gc=Cf (the Unicode format control characters). Its general category is Mn (with cc=0), which makes it formally a *combining mark*, not a control character. > >This really feels like people are trying to figure > >out any way to twist existing constructs to avoid fixing the > >normalization weights. I am alarmed from the implications of putting > >control characters in place to somehow subvert the normalization. There is no "subversion" of normalization involved here. Normalization continues to work just as it always has, with no changes. There is also no cause for alarm. I have been talking about CGJ because someone initially had suggested some kind of control character to adjust normalization or modify combining classes (which *would* be alarming and perverse), and then we cast around to figure out what would happen if any of the existing format control characters (such as ZWJ or ZWNJ) was inserted into these Hebrew vowel sequences. As it turns out, CGJ is just the ticket, because: A. It is not a format control character, but a combining mark. B. It is defined *not* to influence the format of neighboring characters. C. It is, itself, invisible. D. It is already in the standard. (since Unicode 3.2) E. It is defined, by default, to be ignored in searches -- since it becomes significant in collation/searching only when tailored in combinations with other characters. F. Its combining class is zero. G. And most importantly, when inserted between two Hebrew points in a sequence, it has precisely the required effects for normalized Hebrew text, enabling the preservation of point ordering distinctions in normalized contexts. > As for the details of CGJ, please tell me where I can find a detailed > definition, and where it is specifically stated that a *rendering > engine* is obliged to process this *internally* as a control character - > and what precisely it is supposed to do with it if it does. There is no such obligation on a rendering engine. And if the implementers of rendering engines will simply "paint" instances of U+034F so that they become available to the font side of the rendering equation, then it should be relatively simple, as for the Biblical Hebrew point sequence cases, to get the sequences to display properly. > I am now > wondering if anyone understands what this character is supposed to be or > do. If this is not clearly defined anywhere, perhaps UTC needs to write > a clear definition. At least Ken Whistler seems to think that it is > appropriate for this use. Yes, I do -- as does Mark Davis. > Meanwhile, if despite this CGJ is not in fact > appropriate for this function, maybe we should propose a new character > which does have the appropriate properties. CGJ *does* have the appropriate properties. So proposing a new character would simply postpone resolution of the problem for Biblical Hebrew. --Ken ------------- End Forwarded Message -------------