From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 23 2003 - 18:54:05 EDT
> I have been doing a little research into the defined properties of CGJ. 
> I note also that according to 
> http://www.unicode.org/book/preview/ch03.pdf it is defined in Unicode 
> 4.0 as a "Default Ignorable". Well, I am not surprised that some people 
> are confused ...
Yes, I'm not surprised, either, because the whole philosophical
area of character "nothingness" is fraught with difficulties.
Particularly with Unicode, which has introduced many more kinds
of characters which aren't really there, or characters which
disappear when you look at them in a mirror ;-), it is rather
complex.
Consider all the following categories of "nothingness":
ISO Control (gc=Cc)
Unicode Format Control (gc=Cf)
Layout Control (gc=Cf, Zl, Zp, some Cc, and arguably, spaces)
Space (gc=Zs)
White_Space
Blank (of glyph)
Placeholder (e.g. U+FFFC OBJECT REPLACEMENT CHARACTER)
Default_Ignorable_Code_Point
They don't define all the same classes, and overlap in funny
ways, sometimes.
> According to this, 
> "Default ignorable code points are those that should be ignored by 
> default in rendering (unless explicitly supported)... An implementation 
> should ignore default ignorable characters in rendering whenever it does 
> /not/ support the characters." So my suggestion that a renderer should 
> simply ignore CGJ is far from twisting the requirements of Unicode, it 
> is in fact a requirement of Unicode 4.0 though one that I am hardly 
> surprised that some people have missed.
Here is the wording from Unicode 4.0:
====================================================================
Default ignorable code points are those that should be ignored by
default in rendering unless explicitly supported. They have no
visible glyph or advance width in and of themselves, although they
may affect the display, positioning, or adornment of adjacent or
surrounding characters. ...
And implementation should ignore default ignorable characters in
rendering whenever it does *not* support the characters. ...
With default ignorable characters, such as U+200D ZERO WIDTH JOINER,
the situation is different [from the normal case where an unsupported
character would be displayed with a black box, for example]. If the
program does not support that character, the best practice is to
ignore it completely without displaying a last-resort glyph or
a visible box because the normal display of the character
is invisible: Its effects are no other characters. Because the
character is not supported, those effects cannot be shown.
                              -- TUS 4.0, p. 142.
                              
=====================================================================
This wording was, of course, written with such format controls
as ZWJ and ZWNJ in mind, which *do* have formatting effects
on adjacent characters. But the CGJ is also given the
Default_Ignorable_Code_Point property. In fact, in order to get
that (derived) property, it has to be *explicitly* given the
Other_Default_Ignorable_Code_Point property in PropList.txt,
since it (along with the variation selectors) are gc=Mn (non-spacing
combining marks), which aren't automatically defined to be
default ignorable.
Where the CGJ differs from the format controls (and the variation
selectors, for that matter) is that it is defined to have *no*
formatting effect on neighboring characters. So even if you
don't formally support it, you know that it shouldn't be having
any effect on the formatting of neighboring characters.
However, making it default ignorable is the right thing to do,
because it is itself always invisible for display. (Unless you
are doing a Show Hidden display, of course.)
 
> The internal process by which a particular renderer implements ignoring 
> a glyph is a matter for a particular implementation. John Hudson and I 
> have suggested a mechanism for doing this with Uniscribe by treating the 
> character internally as a normal character with a blank glyph and always 
> ligating it with the preceding character. There may be other mechanisms 
> which are cleaner. But in any case it seems to be a requirement not just 
> for fixing this Hebrew problem but for conformance with Unicode as a 
> whole that some such mechanism is implemented, so that CGJ is ignored by 
> the renderer unless some specific behaviour is defined.
Correct. And the difficulty seems to be in the interpretation of
what "ignored by the renderer" means and what obligations it
places on implementations. If "ignored by the renderer" is taken
as swallowed internally in the script logic and never presented
to the actual glyph display mechanism (i.e., never "paint" it),
then we run into the trouble that John Hudson has been
talking about for use of format controls. But if "ignored by
the renderer" is taken as do no processing in the script logic
and instead just present it blindly to the actual glyph
display mechanism, where the fonts then deal with its default
ignorable status by rendering it with a non-advance, blank glyph
rather than the missing glyph box, then we are in a position to
have both the text processing requirements and the display
requirements for Biblical Hebrew neatly met.
And the bonus is this: any other case of mismatch between
required distinctions for ordering of combining marks for
any script, where normalization of the text would result in
collapse of distinctions or unexpected order, can *also*
be dealt with by the same use of CGJ. No special cases are
required, no new characters are required, and no change
of any properties are required.
>  In the case of 
> rendering Hebrew, there seems to be no pressing need to define specific 
> behaviour as the default is at least close to what is required.
Exactly. And frankly, I am finding it difficult to understand
why people are characterizing the CGJ proposal as a kludge
or an ugly hack. It strikes me as a rather elegant way of
resolving the problem -- using existing encoded characters and
existing defined behavior.
And as Peter Kirk pointed out, in the main Unicode electronic
corpus in question, the *data* fix involved for this is
insertion of CGJ in 367 instances of Yerushala(y)im plus a
smattering of other places. That is *way* less disruptive
than the proposal to replace all of the Hebrew points with cloned
code points. It is *way* *way* *way* less disruptive than the
impact of destabilizing normalization by trying to change the
combining classes. And it is far more elegant than trying to
catalog and encode Hebrew point combinations as separate
characters.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jul 23 2003 - 19:32:50 EDT