Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 23 2003 - 18:54:05 EDT

  • Next message: Peter Kirk: "Re: Yerushala(y)im - or Biblical Hebrew"

    > I have been doing a little research into the defined properties of CGJ.
    > I note also that according to
    > http://www.unicode.org/book/preview/ch03.pdf it is defined in Unicode
    > 4.0 as a "Default Ignorable". Well, I am not surprised that some people
    > are confused ...

    Yes, I'm not surprised, either, because the whole philosophical
    area of character "nothingness" is fraught with difficulties.
    Particularly with Unicode, which has introduced many more kinds
    of characters which aren't really there, or characters which
    disappear when you look at them in a mirror ;-), it is rather
    complex.

    Consider all the following categories of "nothingness":

    ISO Control (gc=Cc)
    Unicode Format Control (gc=Cf)
    Layout Control (gc=Cf, Zl, Zp, some Cc, and arguably, spaces)
    Space (gc=Zs)
    White_Space
    Blank (of glyph)
    Placeholder (e.g. U+FFFC OBJECT REPLACEMENT CHARACTER)
    Default_Ignorable_Code_Point

    They don't define all the same classes, and overlap in funny
    ways, sometimes.

    > According to this,
    > "Default ignorable code points are those that should be ignored by
    > default in rendering (unless explicitly supported)... An implementation
    > should ignore default ignorable characters in rendering whenever it does
    > /not/ support the characters." So my suggestion that a renderer should
    > simply ignore CGJ is far from twisting the requirements of Unicode, it
    > is in fact a requirement of Unicode 4.0 though one that I am hardly
    > surprised that some people have missed.

    Here is the wording from Unicode 4.0:

    ====================================================================

    Default ignorable code points are those that should be ignored by
    default in rendering unless explicitly supported. They have no
    visible glyph or advance width in and of themselves, although they
    may affect the display, positioning, or adornment of adjacent or
    surrounding characters. ...

    And implementation should ignore default ignorable characters in
    rendering whenever it does *not* support the characters. ...

    With default ignorable characters, such as U+200D ZERO WIDTH JOINER,
    the situation is different [from the normal case where an unsupported
    character would be displayed with a black box, for example]. If the
    program does not support that character, the best practice is to
    ignore it completely without displaying a last-resort glyph or
    a visible box because the normal display of the character
    is invisible: Its effects are no other characters. Because the
    character is not supported, those effects cannot be shown.

                                  -- TUS 4.0, p. 142.
                                  
    =====================================================================

    This wording was, of course, written with such format controls
    as ZWJ and ZWNJ in mind, which *do* have formatting effects
    on adjacent characters. But the CGJ is also given the
    Default_Ignorable_Code_Point property. In fact, in order to get
    that (derived) property, it has to be *explicitly* given the
    Other_Default_Ignorable_Code_Point property in PropList.txt,
    since it (along with the variation selectors) are gc=Mn (non-spacing
    combining marks), which aren't automatically defined to be
    default ignorable.

    Where the CGJ differs from the format controls (and the variation
    selectors, for that matter) is that it is defined to have *no*
    formatting effect on neighboring characters. So even if you
    don't formally support it, you know that it shouldn't be having
    any effect on the formatting of neighboring characters.
    However, making it default ignorable is the right thing to do,
    because it is itself always invisible for display. (Unless you
    are doing a Show Hidden display, of course.)

     
    > The internal process by which a particular renderer implements ignoring
    > a glyph is a matter for a particular implementation. John Hudson and I
    > have suggested a mechanism for doing this with Uniscribe by treating the
    > character internally as a normal character with a blank glyph and always
    > ligating it with the preceding character. There may be other mechanisms
    > which are cleaner. But in any case it seems to be a requirement not just
    > for fixing this Hebrew problem but for conformance with Unicode as a
    > whole that some such mechanism is implemented, so that CGJ is ignored by
    > the renderer unless some specific behaviour is defined.

    Correct. And the difficulty seems to be in the interpretation of
    what "ignored by the renderer" means and what obligations it
    places on implementations. If "ignored by the renderer" is taken
    as swallowed internally in the script logic and never presented
    to the actual glyph display mechanism (i.e., never "paint" it),
    then we run into the trouble that John Hudson has been
    talking about for use of format controls. But if "ignored by
    the renderer" is taken as do no processing in the script logic
    and instead just present it blindly to the actual glyph
    display mechanism, where the fonts then deal with its default
    ignorable status by rendering it with a non-advance, blank glyph
    rather than the missing glyph box, then we are in a position to
    have both the text processing requirements and the display
    requirements for Biblical Hebrew neatly met.

    And the bonus is this: any other case of mismatch between
    required distinctions for ordering of combining marks for
    any script, where normalization of the text would result in
    collapse of distinctions or unexpected order, can *also*
    be dealt with by the same use of CGJ. No special cases are
    required, no new characters are required, and no change
    of any properties are required.

    > In the case of
    > rendering Hebrew, there seems to be no pressing need to define specific
    > behaviour as the default is at least close to what is required.

    Exactly. And frankly, I am finding it difficult to understand
    why people are characterizing the CGJ proposal as a kludge
    or an ugly hack. It strikes me as a rather elegant way of
    resolving the problem -- using existing encoded characters and
    existing defined behavior.

    And as Peter Kirk pointed out, in the main Unicode electronic
    corpus in question, the *data* fix involved for this is
    insertion of CGJ in 367 instances of Yerushala(y)im plus a
    smattering of other places. That is *way* less disruptive
    than the proposal to replace all of the Hebrew points with cloned
    code points. It is *way* *way* *way* less disruptive than the
    impact of destabilizing normalization by trying to change the
    combining classes. And it is far more elegant than trying to
    catalog and encode Hebrew point combinations as separate
    characters.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 23 2003 - 19:32:50 EDT