Re: Orthographies using ZWNJ (was: Displaying control characters)

From: Asmus Freytag (
Date: Tue Jul 24 2007 - 02:52:13 CDT

  • Next message: Andreas Prilop: "Titles and headings in Georgian script"

    On 7/23/2007 8:51 PM, Philippe Verdy wrote:
    > Asmus Freytag wrote:
    >> Rest assured, the WJ would be quite incorrect. The fact that you keep
    >> repeating this indicates that you did not read the standard or any of my
    >> other posts.
    > Rest assured that I read the standard and did not find any rationale about
    > the use orsemantics of WJ compared to ZWNJ which was introduced only much
    > later to replace the deprecated ZWNBSP used now as a BOM.
    You are asking the wrong question. The standard is unambiguous about the
    use of WJ (see below). There is no need to "compare" it to ZWNJ. There's
    also no need to compare it to the PLUS SIGN.

    The standard also is unambiguous about the use of ZWNJ.

    Some *implementations* break ligatures when they encounter *any*
    character among the characters that are intended to form the ligature.
    That is an *implementation* problem, not a problem of the *standard*.
    > I absolutely don't care about the linguisitic definitions of "syllables"
    > bercause this cannot be treated at the encoding or local orthographic level
    > without the help of some language-specific dictionary. These linguisitic
    > syllables are NOT a property of the script with which these languages are
    > written, and Unicode does not encode the languages, so it cannot treat them.
    > However It's up to Unicode to define the way a script can be encoded to
    > specify essential things like the prohibition or preference of ligatures, or
    > the prohibition or suggested "syllable" breaks.
    > Yes, English lacks a correct word for saying "syllable breaks", i.e. the
    > fact that some places in a word can be used to split it to separate lines,
    > possibly also adding some visible mark when this occurs. What I really mean
    > by "syllable break" in ALL what I have written since now is what is meant by
    > the much more precise French term "césure".
    Well, that doesn't help me, because I don't speak that much French.

    Why don't we then stop talking about syllables, and use the terminology
    that's appropriate in the context of the Unicode Standard. The SHY
    essentially marks an "intra-word line break opportunity", and as we all
    now agree on, it's irrelevant whether these line up with syllables

    The Word Joiner explicitly defines the *absence* of a any line break
    opportunity -- and UAX#14 (which fully is a part of the standard) says
    that quite clearly.
    > You tried to use the terms "word breaking" but this term seems wrong too for
    > this usage: for me word breaking is the fact of splitting a text into
    > separate words, not the fact of finding possible breaks within a word.
    But you understand what I meant: locating line break opportunities
    within words (defined loosely as otherwise unbroken runs of text). So,
    let's continue on that basis, then.
    > All your misunderstanding of what I meant (suggesting that I wanted to
    > redefine things, which I am not) is caused by the misunderstanding of the
    > English expression "syllable break". Read it as the French term "césure",
    > which is much better than "syllable break" (even though no "césure" can
    > occur in the middle of a linguistic syllable in French).
    Switching languages is not the answer - I'm sure the German terms are
    much clearer than either the French or the English, and if that isn't
    enough, we could switch to Swedish. ;-)

    Therefore, I'm skipping this diverting digression...
    > And yes I know that a césure is *preferably* not used in every places (but
    > absolutely NOT forbidden), for stylistic reasons (in French it is preferable
    > to not insert a césure after the prefixes "con-", "cul-",... or in the
    > middle of "coha-bite" for the same reasons that it would be read as
    > offensive.
    > I say "preferably", because there are frequent cases where this use is
    > wanted by authors, notably in poestry and the texts of songs (where the
    > césure is made audible by the rhythm or the melody), but also for the most
    > vernacular use. Look at the French article about "césure" in Wikipédia,
    > you'll find some external references about these funny césures used
    > purposely in songs; the most wellknown cases in France being those from
    > Serge Gainsbourg who was known to have an excellent mastership of the
    > correct French language (despite his language was perceived as crude and
    > shocking in the 1960's). I'm sure that such authors also exist in other
    > international cultures, and that playing with the too strict commonly
    > admitted language rules is wanted in every cultures, that don't want to
    > restrict the language only to formal uses.
    > So even if a language will preferably be not rendered with these generally
    > undesired césures, or will preferably not leave a short syllable alone with
    > just one or two letters for typographic reasons, these considerations are
    > NOT considered incorrect for the language itself, where preferences of style
    > is left as a choice by the author.
    .. until here.
    > Now let's get back to Unicode and text encodings: how can an author specify
    > simultaneously in the text where ligatures can or cannot occur, and where
    > césures
    recte: line break opportunities
    > can or cannot occur? And if it occurs, how must the cesures
    recte: line break opportunities
    > be
    > presented (standard hyphenation with a hyphen mark at end of the first line,
    > as implied by SHY, is not the only option, and even Latin-written languages
    > have other requirements about how a césure should be presented).
    These questions are answered in large measure in UAX#14, and in chapter 16.

    Ligatures you suppress with ZWNJ and line break opportunities you mark
    with ZWSP (if you are interested in in one that, like a SPACE does not
    affect the ending of the preceding line or start of the next line.

    If you are interested in marking a a line break opportunity such that
    the end of the preceding line or the start of the next line get
    modified, e.g. by displaying a hyphen for a European language, then you
    use SHY. UAX#14, as you could have found out by reading it, is quite
    clear that SHY can be used for any language and any script - the effect
    is dependent on the rules of the orthography in question.

    SHY and ZWNJ are needed only where automatic hyphenation and automatic
    ligation mechanisms fail to do the proper thing. In the German
    orthography, the ZWNJ would be more often needed, since the ambiguities
    are greater in placing ligatures, than in hyphenation. Missing a
    possible hyphenation, or even adding an extra one, is something that the
    user will only ever see if the lines break just so. A wrong ligation
    will be displayed whenever ligation is enabled.

    You rarely actually need to supply both, but if you do, the correct way
    is for an implementation to not break a ligature across an SHY. But we
    had that result three messages ago, so I don't know what we are
    discussing here.


    This archive was generated by hypermail 2.1.5 : Tue Jul 24 2007 - 02:55:17 CDT