From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 06 2007 - 18:10:27 CST
Otto Stolz wrote:
> Philippe Verdy had written:
> > In fact, the rule that determines if syllable break are
> > disallowed is based on radicals, not on syllables...
>
> From the original context, I guess, Philippe meant to write
> "the rule that determines if ligatures are disallowed...".
>
> Hence, I had replied:
> > True. (I prefer to call those radicals "constituents".)
Most problably, I wondered wihich term to use, but I wanted to show that the
syllabic boundary is not pertinent when looking for candidate ligatures.
All happens in German as if there was a space, even if it is invisible. It
could even be modelled using a zero-width space between constituents, this
space acting a bit differently from the normal space, in regard to German
capitalization rules for nouns, but not differently as it blocks a ligature.
One way to encode it would be to use ZWSP in the middle of the word, but
this would still be not correct, as this space could break silently in case
of linewrapping, without hyphenation. So the separation between the
components are more like <SHY,ZWSP>: this zero width space is even
expansible if needed, to avoid collisions of letters (for example in
"...<SHY,ZWSP>..." or "...f<SHY,ZWSP>...".
Of course there are cases were a hyphenation or break is undesirable, but no
ligature should still occur, but here again the solution would be to use
<ZWNBSP> instead (that disables the ligature and is expansible when needed
to avoid glyph collisionq, but is effectively blocking the linebreak).
The main problem with this approach is that using such controls in German
texts would be a severe pane, no text is entered like this. For this reason,
the best thing you can do is to not encode any control in texts meant for
interchange.
Instead, the renderer will avoid (by default, when not knowing the language)
creating any ligature, and will not attempt to create linebreaks. It will
just try to avoid glyph collisions by acting on the interletter spacing gap.
But a renderer that KNOWS the language, and can detect the morphemic
delimitations (between components including adverbial "particles", but not
between all syllables, and not between the radical of the component and its
desinence suffix), can instruct a less smart renderer to render the document
properly, by generating controls on the fly, during document preparation.
In a word processor, or typesetting application, one could manually insert
some controls when there are possibly multiple choices, and an automatic
delimitation just selects the most common/probable case, in the document
preparation process before final rendering.
The good question is then: which control can we use in such cases for:
* instructing a less capable renderer (that is blind to language semantics),
* interchanging prepared documents containing manually inserted controls (to
avoid repeating the document preparation process),
without breaking the usability of the document (for plain-text search, or
similar) and without having a spell-checker complaining about incorrect
spelling (such as missing capitals in what it may think is a word
delimitation, when it is just a component delimitation)?
Unicode seems to choose WORD JOINER and WORD NON JOINER for such thing.
Finally, the rules are bit complex because they must coexist with the other
rules governing hyphenation (which are not just based on syllable breaks,
but also depending on other semantic-only rules that prohibit some
undesirable breaks, and on other typographical/rules defining the minimum
length allowed for broken syllables and that may be defined in terms of
minimum count of letters or minimum count of final grapheme clusters
including ligatures or minimum physical length on the rendering device):
SHY is there for indicating allowed breaks (and it is most useful only for
less capable renderers that DO NOT allow any break by default without it).
But we need something else for instructing exactly the opposite (undesired
breaks), as well as another way to indicate a break that requires another
character than a hyphen (the Catalan case can be inferred, so that
middle-dot+SHY will allow linebreaking, but where the additional hyphen is
not inserted)
The set of allowed component breaks (where ligatures are blocked) is NOT the
same as the set of syllable breaks (where hyphenation or linebreaking is
allowed).
Anyway, the word joiner/disjoiner controls (governing the formation or
prohibition of ligature) should be coded separately from the syllable break
controls: syllable break controls should be coded before the word joining
controls if both occur at the same place in a word.
However, Unicode says absolutely nothing about syllable breaks and component
breaks: they are locale-dependant, and style dependant, so there's no
algorithm to determine them, not even to determine a minimum set of allowed
or prohibited breaks: between the default grapheme boundaries and a full
word boundaries that are standardized with a good default algorithm (which
produces correct results in most cases but an still be tailored or
controlled externally by inserting some controls in the encoded texts, so
that they word with these algorithms without requiring tailoring), there are
these two *distinct* (undescribed) levels of boundaries.
This archive was generated by hypermail 2.1.5 : Tue Nov 06 2007 - 18:54:23 CST