RE: Orthographies using ZWNJ (was: Displaying control characters)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jul 18 2007 - 11:36:24 CDT

  • Next message: Asmus Freytag: "Re: Orthographies using ZWNJ (was: Displaying control characters)"

    Karl Pentzlin
    > Envoyé : mercredi 18 juillet 2007 13:08
    > À : Behnam
    > Cc : Unicode List
    > Objet : Orthographies using ZWNJ (was: Displaying control characters)
    >
    > Am Mittwoch, 18. Juli 2007 um 12:46 schrieb Behnam:
    >
    > B> I know of at least two languages that use ZWNJ on the keyboard and
    > B> ZWNJ (and ZWJ to a lesser extend) are within text encoding: Persian
    > B> and Kurdish (Sorani)
    >
    > ZWNJ is also needed for German (in advanced typography and when using
    > Fraktur), as typesetting rules prohibit visible ligatures e.g. at the
    > border
    > of constituents of compound nouns. E.g., "Schilfinsel" (island full of
    > reed, compound of "Schilf" + "Insel") needs a ZWNJ between the f and i
    > to prevent a visible fi ligature there.

    Can someone explain the effective difference between WORD JOINER (U+2060)
    (that also prohibits ligatures) and ZERO-WIDTH NON-JOINER (U+200C), given
    that they are both intended to be zero-width invisible, and they are both
    format controls?

    I suspect that:
    * ZERO-WIDTH NON-JOINER (ZWNJ) is just used to avoid formatting only of
    ligatures (i.e. it is just an hint for renderers to help choose between a
    ligated non-ligated forms), but it does not mark explicitly that a
    syllable-break or hyphenation is prohibited (i.e. it may occur un the middle
    of a syllable or at any place in a word where syllable breaks may eventually
    occur)
    * WORD JOINER (WJ) marks explicitly a syllable break that must not be
    ligated because it joins two words (and it is then explicitly a syllable
    break candidate by itself).

    If this is correct, then WJ is just like a combination of ZWNJ and a sort of
    invisible soft hyphen (SHY), it marks a syllable break, except that when a
    SHY occurs an effective line-break, SHY transforms into a visible hyphen but
    not WJ, and SHY does not prohibits ligatures in words like "effect" where it
    would occur encoded as "ef<SHY>fect" and where it should be rendered
    "ef-<line break>fect" or as "e<ff-ligature>ect. Note that the presence of a
    SHY does not prohibit a ligature here.

    So "ef<WJ>fect" would prohibit the ligature and will always be rendered as
    "ef<no-ligature>fect" or as "ef<line-break>fect". Same thing in
    "dif<WJ>ference" where the two options are possible, but always without
    ligatures, and without a visible hyphen when a line-break occurs on a
    syllable break.

    And "ef<ZWNJ>fect" will prohibit the ligature (as explicitly documented in
    the Unicode standard) but will not be explicitly a candidate syllable break
    (it should not occur in Latin typography given that the first syllable is
    too short with only 2 letters), so it will always be rendered as "ef<no
    ligature>fect", unless such break is expected by the author using
    "ef<ZWNJ><SHY>fect" and in that case it will be rendered either as
    "ef<no-ligature>fect" or as "ef-<linebreak>fect"

    If this is something else, which options do we have to explicitly mark
    syllable breaks without ligatures, with or without a visible hyphen?

    What will happen with joining scripts (i.e. Arabic, Devanagari...) or
    cursive styles of alphabetic scripts? Does a prohibition of ligature also
    prohibit the usual joining?

    Does any of these allow avoiding the aggregation into a Hangul cluster? (I
    suspect none of them are designed to control that, given that Hangul has a
    specific way to mark syllable breaks between jamos, but this may occur
    sometimes between two leading consonant jamos considered as a single
    syllable, in historical texts where there may be more than just leading
    "double"-consonnants, i.e. SANG-letters)



    This archive was generated by hypermail 2.1.5 : Wed Jul 18 2007 - 11:38:44 CDT