RE: German ligatures

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 06 2007 - 18:10:27 CST

  • Next message: Rick McGowan: "New Public Review Issue: #116 Proposed Update UTS #35 LDML"

    Otto Stolz wrote:
    > Philippe Verdy had written:
    > > In fact, the rule that determines if syllable break are
    > > disallowed is based on radicals, not on syllables...
    >
    > From the original context, I guess, Philippe meant to write
    > "the rule that determines if ligatures are disallowed...".
    >
    > Hence, I had replied:
    > > True. (I prefer to call those radicals "constituents".)

    Most problably, I wondered wihich term to use, but I wanted to show that the
    syllabic boundary is not pertinent when looking for candidate ligatures.

    All happens in German as if there was a space, even if it is invisible. It
    could even be modelled using a zero-width space between constituents, this
    space acting a bit differently from the normal space, in regard to German
    capitalization rules for nouns, but not differently as it blocks a ligature.

    One way to encode it would be to use ZWSP in the middle of the word, but
    this would still be not correct, as this space could break silently in case
    of linewrapping, without hyphenation. So the separation between the
    components are more like <SHY,ZWSP>: this zero width space is even
    expansible if needed, to avoid collisions of letters (for example in
    "...<SHY,ZWSP>..." or "...f<SHY,ZWSP>...".

    Of course there are cases were a hyphenation or break is undesirable, but no
    ligature should still occur, but here again the solution would be to use
    <ZWNBSP> instead (that disables the ligature and is expansible when needed
    to avoid glyph collisionq, but is effectively blocking the linebreak).

    The main problem with this approach is that using such controls in German
    texts would be a severe pane, no text is entered like this. For this reason,
    the best thing you can do is to not encode any control in texts meant for
    interchange.

    Instead, the renderer will avoid (by default, when not knowing the language)
    creating any ligature, and will not attempt to create linebreaks. It will
    just try to avoid glyph collisions by acting on the interletter spacing gap.

    But a renderer that KNOWS the language, and can detect the morphemic
    delimitations (between components including adverbial "particles", but not
    between all syllables, and not between the radical of the component and its
    desinence suffix), can instruct a less smart renderer to render the document
    properly, by generating controls on the fly, during document preparation.

    In a word processor, or typesetting application, one could manually insert
    some controls when there are possibly multiple choices, and an automatic
    delimitation just selects the most common/probable case, in the document
    preparation process before final rendering.

    The good question is then: which control can we use in such cases for:
    * instructing a less capable renderer (that is blind to language semantics),
    * interchanging prepared documents containing manually inserted controls (to
    avoid repeating the document preparation process),
    without breaking the usability of the document (for plain-text search, or
    similar) and without having a spell-checker complaining about incorrect
    spelling (such as missing capitals in what it may think is a word
    delimitation, when it is just a component delimitation)?

    Unicode seems to choose WORD JOINER and WORD NON JOINER for such thing.

    Finally, the rules are bit complex because they must coexist with the other
    rules governing hyphenation (which are not just based on syllable breaks,
    but also depending on other semantic-only rules that prohibit some
    undesirable breaks, and on other typographical/rules defining the minimum
    length allowed for broken syllables and that may be defined in terms of
    minimum count of letters or minimum count of final grapheme clusters
    including ligatures or minimum physical length on the rendering device):

    SHY is there for indicating allowed breaks (and it is most useful only for
    less capable renderers that DO NOT allow any break by default without it).
    But we need something else for instructing exactly the opposite (undesired
    breaks), as well as another way to indicate a break that requires another
    character than a hyphen (the Catalan case can be inferred, so that
    middle-dot+SHY will allow linebreaking, but where the additional hyphen is
    not inserted)

    The set of allowed component breaks (where ligatures are blocked) is NOT the
    same as the set of syllable breaks (where hyphenation or linebreaking is
    allowed).

    Anyway, the word joiner/disjoiner controls (governing the formation or
    prohibition of ligature) should be coded separately from the syllable break
    controls: syllable break controls should be coded before the word joining
    controls if both occur at the same place in a word.

    However, Unicode says absolutely nothing about syllable breaks and component
    breaks: they are locale-dependant, and style dependant, so there's no
    algorithm to determine them, not even to determine a minimum set of allowed
    or prohibited breaks: between the default grapheme boundaries and a full
    word boundaries that are standardized with a good default algorithm (which
    produces correct results in most cases but an still be tailored or
    controlled externally by inserting some controls in the encoded texts, so
    that they word with these algorithms without requiring tailoring), there are
    these two *distinct* (undescribed) levels of boundaries.



    This archive was generated by hypermail 2.1.5 : Tue Nov 06 2007 - 18:54:23 CST