From: Mark Davis (
Date: Thu Nov 25 2004 - 09:38:54 CST

  • Next message: Addison Phillips [wM]: "RE: Shift-JIS conversion."

    I want to correct some misperceptions about CGJ; it should not be used for

    From, down on
    page 392 (sorry for the boxes, that's Acrobat).

    U+034F    is used to indicate that adjacent
    characters are to be
    treated as a unit for the purposes of language-sensitive collation and
    searching. In language-
    sensitive collation and searching, the combining grapheme joiner should be
    unless it specifically occurs within a tailored collation element mapping.
    Thus it is given a
    completely ignorable collation element in the default collation table, like
     (see Unicode
    Technical Standard #10, “Unicode Collation Algorithm,” and also ISO/IEC
    However, it can be entered into the tailoring rules for any given language,
    using the tailoring
    capabilities of the collation standards.

    For rendering, the combining grapheme joiner is invisible. However, some
    older implementations
    may treat a sequence of grapheme clusters linked by combining grapheme
    as a single unit for the application of enclosing combining marks. For more
    on grapheme clusters, see Unicode Technical Report #29, “Text Boundaries.”
    For more
    information on enclosing combining marks, see Section 3.11, Canonical
    Ordering Behavior.

    The combining grapheme joiner must not be confused with the zero width
    joiner or the
    word joiner, which have very different functions. In particular, inserting a
    grapheme joiner between two characters should have no effect on their
    ligation or cursive
    joining behavior. Where the prevention of line breaking is the desired
    effect, the word
    joiner should be used. For more information on the behavior of these
    characters in line
    breaking, see Unicode Standard Annex #14, “Line Breaking Properties.”


    ----- Original Message -----
    From: "Doug Ewell" <>
    To: "Unicode Mailing List" <>
    Cc: <>
    Sent: Wednesday, November 24, 2004 22:09
    Subject: Re: CGJ , RLM

    > "kefas" <pmr at informatik dot uni dash frankfurt dot de> wrote:
    > > 1. U+034F CGJ, Combining Grapheme Joiner, is
    > > displayed as a tall rectangle in MSKLCexe-test and as
    > > a capital square in OutlookExpress A͏E a͏e͏a͏e. But
    > > CGJ "has no visible glyph"! Thus CGJ is not
    > > implemented correctly in Arial Unicode MS. Or are the
    > > editors not implemented correctly?
    > U+034F was added to Unicode 3.2 in March 2002. Your copy of Arial
    > Unicode MS may have been released before that date. Or it may be that
    > Microsoft has chosen not to implement U+034F in this particular font,
    > which is not the same as implementing it incorrectly.
    > > Should A+CGJ+E
    > > yield the Danish double letter a+(e-attached) ? Or
    > > do I hope in vain.
    > Someone, some day may choose to render A + CGJ + E as Æ. Don't be
    > misled into thinking they are equivalent, however.
    > > Is there a general rule how graphically to join 2
    > > arbitrary characters? Normal tf looks already joined
    > > to me, and causes me problems of recognizing t and f
    > > as distinct letters. (I have astigmatism: cyl -3.0,
    > > which is not that rare) m and rn look the same from
    > > normal reading distance!. Some editors / some fonts
    > > display an m with uneven spacing of legs, which looks
    > > to me as if r+n is written. Any help in planning (you
    > > font-designers)?
    > There probably could not be a general rule about this, because it is too
    > dependent on individual typeface designs. Sans-serif fonts like Arial
    > will likely have many more "joined" combinations than serif fonts like
    > Times, because the serifs interrupt the joining behavior. Whether the
    > horizontal strokes on a "t" and an "f" line up with each other is also
    > highly font-dependent. In many cases they do not.
    > I think I have your astigmatism beat, at least in one eye.
    > > 2. RLM, the Right to Left marker, seems to have no
    > > effect yet. Hebrew bet+RLM+SPace should leave the
    > > Cursor at Left and not 'jump' to the right of bet as
    > > it does for good or worse for bet+SP. If this is a
    > > correct expectation, then how can I tell (e.g. via
    > > MSKLC.exe) to insert RLM+SPace on CAPS+SPace ?
    > This may have more to do with the rendering engine than with the font.
    > -Doug Ewell
    > Fullerton, California

    This archive was generated by hypermail 2.1.5 : Thu Nov 25 2004 - 09:43:04 CST