Re: Orthographies using ZWNJ (was: Displaying control characters)

From: Asmus Freytag (
Date: Sun Jul 22 2007 - 16:48:10 CDT

  • Next message: Behnam: "Re: [hebrew] Re: Karaite manuscript"

    On 7/22/2007 11:47 AM, Philippe Verdy wrote:
    > Asmus Freytag wrote:
    >>> If this is something else, which options do we have to explicitly mark
    >>> syllable breaks without ligatures, with or without a visible hyphen?
    >>> What will happen with joining scripts (i.e. Arabic, Devanagari...) or
    >>> cursive styles of alphabetic scripts? Does a prohibition of ligature also prohibit the usual joining?
    >> If you had read the standard, before creating your own alternate
    >> reality, you wouldn't need to ask that question. The role of ZWNJ in
    >> joining is explicitly described.
    > You don't need to rant about my reading of the standard. I have said in my
    > message that ZWNJ was used to control ligation/joining during rendering. I
    > spoke about something else.
    I still believe that you can find the answer to that particular question
    in the standard.
    > You affirm that Unicode does not encode syllable breaks but it's completely
    > wrong. SHY is a perfect example of an explicit syllable break.
    Before you claim that some statement is "completely wrong" it's best to
    be clear about what the statement actually contained. I had written:

    "There is no 'syllable delimiter' in Unicode. "

    While the SHY often appears at a syllable boundary, it's function is not
    to mark syllables, but to mark places where a word may be split during
    line breaking. I believe, it is not difficult to come up with examples
    of syllable boundary that should not be used for line breaking. Placing
    an SHY at those locations, while syllable boundaries, would be a mistake.

    Therefore, my statement stands: no character in Unicode is dedicated to
    be a syllable delimiter.
    > I was speaking about the effect of combining or detaching the effect of
    > syllable breaks and ligatures. My question is still not answered.
    > What I have seen is that the presence of a word joiner really prevents a
    > ligature, although it is not specified anywhere;
    That's an implementation shortcut, but, as you correctly state, not
    something you should rely on, as it's not sanctioned in the standard.
    > and if it is used as an
    > invisible syllable break (which will never be rendered as a hyphen if a line
    > break occurs) for compound words that are normally not separated by space or
    > hyphen, but that may still be split if needed on line boundaries,
    Now this is 'completely wrong'! A Word Joiner, by definition, prevents
    line breaks. So, if you place one between two syllables they may *not*
    be split if needed on line boundaries. In Unicode 5.0, this behavior is
    > I think
    > it's normal that it prevents the formation of a ligature.
    The reason it does has nothing to do with the argument you just
    presented (as it is based on a faulty premise). The reason is that
    ligature formation is based on more or less mechanical lookup of pairs
    or triplets, and if a WJ is not filtered, it will disrupt the lookup.
    The only character sanctioned for this purpose is ZWNJ.
    > Now the question remains: what is the effective difference between WJ and
    > ZWNJ? I can't see any, both on the morphological analysis side, and on the
    > rendering side.
    This question does not remain - the differences are obvious and well
    specified in the text of Unicode 5.0.
    > If WJ is not expected to break a ligature, this should be specified so that
    > ZWNJ will be used explicitly to control that (WJ will still be used to
    > control word breaks, mostly in scripts that have no required word separation
    > by spaces or other punctuation marks)
    > I saw this concern when replying to the message sent by Karl Pentzlin
    > speaking about the compound word "Schilfinsel" (i.e. "Schilf" + "Insel"
    > without a "fi" ligature), that he wants to encode as "Schilf<ZWNJ>insel",
    > where the absence of ligature is expected to really mark the internal
    > syllable break.
    Karl is entirely correct. German orthography disallows ligation in such
    situations, and the UTC decided to give the ZWNJ the role of ligature
    preventer based on this case (and similar cases in other languages).
    > German compound words (in my opinion) contain mor than just a rendering hint
    > (ZWNJ) and WJ is certainly more significant to say that. So there are two
    > situations when an author is tuning the rendering of the text and uses a
    > hyphenation algorithm to mark explicitly where syllable breaks will occur:
    > (1) Either the syllable break is wanted and expected here, so he
    > will insert a SHY between the two parts of the word; but SHY still does not
    > prevent a ligature, so he will need BOTH ZWNJ (against the "fi" ligature)
    > and SHY after it: the resulting string will be "Schilf<ZWNJ><SHY>insel";
    There is no question, the sequence you give should render and line break
    correctly in a fully conformant implementation.

    In practice, that author only needs the ZWNJ, because any decent
    hyphenator for German can resolve this particular word and would need no
    tuning. Also, hyphenators do not actually need to insert SHYs.
    > (2) Or the syllable break is not desired, and WJ will be used to say
    > that explicitly (preventing an automated hyphenator to insert a line break
    > here),
    Actually, a WJ is not the designated way to communicate with
    hyphenators. If it is a question of placing a syllable boundary at a
    *different* place, then the convention I've most often seen is that when
    a user adds a SHY anywhere in word, the hyphenator ignores that word.

    That mechanism can be used to distinguish Wach|stube from Wachs<SHY>tube
    where the "|" shows the automatic hyphen. In this scenario, no automatic
    hyphenator can hyphenate this word correctly at all times, as there are
    two words that use the same letters but the location of the compound
    division is their only disticntion.

    In your example, using the convention I just cited, the word would
    become: Schilf<ZWNJ>in<SHY>sel. By explicitly marking the other viable
    word-break point, the hyphenator would leave the word alone. However,
    the resul would be exceedingly awkward for German readers - but we agree
    that this is merely a contrived example.

    In cases like this there's no need for WJ. However, and here we finally
    come to an open question, what is the recommended way to indicate the
    desired *absence* of any automatic hyphens in a given word.

    Logically, placing a single <SHY> at the beginning or end of a word,
    could extend such a convention to prevent any automatic hyphen. But
    those conventions, by necessity, are dependent on the hyphenator, and
    are not specified in the standard. Some implementations may opt to use
    markup/user interface instead.
    > but as WJ does not prevent the ligature (it is not specified, but
    > this ligature avoidance is still occurring with most renderers), so he will
    > need to encode BOTH ZWNJ (against the "fi" ligature) and WJ after it (to
    > disable any hyphenating line break): the resulting string will be
    > "Schilf<ZWNJ><WJ>insel", rather than just "Schilf<WJ>insel".
    > I am not inventing things. This is a "grey area" where something is not
    > clearly specified, and due to the current implementations, I still see no
    > clear difference between the effects of ZWNJ and WJ and how to use them, and
    > what they are effectively preventing or enforcing. If WJ should effectively
    > prevent a ligature, then it should be specified (and using ZWNJ in the
    > alternative (2) above will NEVER be needed)
    > My message had NOTHING that would let someone think that it was a
    > "recommendation" or interpretation. You should have read it as a QUESTION
    > left to discussions.
    Claiming that something is a question when the answers to that are in
    the standard, amounts to a reinterpretation. If all you wanted was raise
    the question: "Do we need a better convention for communicating with
    automatic hyphenators?" it would have been much easier to get to the
    meat of the discussion had you asked the question in that way.


    This archive was generated by hypermail 2.1.5 : Sun Jul 22 2007 - 16:50:37 CDT