Re: Hyphenation Points

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Jan 13 1997 - 07:14:06 EST


On Saturday, January 04, 1997 10:20 AM, Mirko Raner wrote:
> 2. How can hyphenation at a certain position be forced?
> Our idea is to insert a "hint character" at which a word is hyphenated
> when it is split due to line boundaries; the user might want to prescribe
> the hyphenations of a word manually. Unicode provides several hyphen
> characters, but the one we need, is an invisible (zero-width) character
> which simply tells the hyphenation algorithm to hyphenate after this
> position (if hyphenation is necessary).

I think, the SOFT HYPHEN (U+00AD) complies with this description --
though I could not find a description of its semantics in ISO/IEC
10646-1: 1993 (neither in clause 21, "Special Characters", nor in
annex D, "Alternate Format Characters"). Annex D (which is declared
informative rather than normative) says that the dotted box used to
represent this character in table 2 of clause 25 means that the
character does not have a printable graphic symbol.

The SOFT HYPHEN character apparently stems from ISO 8859-1 (as it
is in ISO 10646's LATIN-1 SUPPLEMENT block. The ISO 8859 series
describes its semantics thus:
  A graphic character that is imaged by a graphic symbol identical
  with, or similar to, that representing HYPHEN, for use when a
  line break has been established within a word.

> 1. How can we prohibit a hyphenation at a certain position in a word?
> Sometimes it can be necessary to insert a special character eg in order
> to prevent the hyphenation algorithm from splitting some proper name at a
> wrong position.

I think, this is an unsuited approach: if you mark a particular place
in a word to be not a hyphenating point, how can you be sure that the
hyphenating algorithm will not break the word in some other, even more
inappropriate place? I think, a feasable way for a text processing
application is to use the soft hyphens supplied with the data, for all
words that contain any soft hyphens, and to use algorithmic hyphenation
only for those words that contain no soft hyphen characters, at all. Thus,
you can mark an entire word to be not hyphenated by preceding it with a
soft hyphen character (provided the application will never display a hyphen
for a SOFT HYPHEN at the beginning of a word).

An even better approach would be to allow hyphenating points of various
priorities; you would have to assign some Private Use characters for the
minor hyphenation points to implement this. Thus, the rendering process
could try to minimise the use of "minor" hyphenations points, in a whole
paragraph, preferring word boundaries over hyphenating points, and "good"
hyphenation points over "minor" ones. E.g., in German, the constituents
of a composed word would make for "good", or "preferred" hyphenation points,
the common prefixes would make for secondary ones, and the syllable bound-
aries within word constituents would make for "minor", or tertiary,
hyphenation points, as in
   Sil_ben=tren_nungs=pro-gramm (i.e. hyphenation program),
where "=" denotes a primary, "-" a secondary, and "_" a tertiary hyphenation
point.

As observed by Mirko Raner, some possible hyphenation points should
be deprecated, e.g.
   Ra_di_o=sen#dung (i.e. radio transmission),
where "#" denotes a deprecated hyphenation point, as "Radiosen-
dung" would be highly misleading -- even to English speaking readers :-)
This can be handled by simply marking the primary hyphenation point only
(if the application would not algorithmically hyphenate a word containing
a SOFT HYPHEN, as suggested above).

Aside: in some cases, the hyphenation could even depend on the meaning,
e.g. in
   Drucker=zeugnis (i.e. printer's testimonial) vs.
   Druck=erzeugnis (i.e. printing product);
in cases like this, German orthography rules recommend (but do not request)
to spell explicit hyphens, as in "Druck-Erzeugnis"). This case could also
by handled by marking the primary hyphenation point only.

A related topic is the rendering of ligatures. In typesetting German, a
ligature may not be used across possible hyphenation points. E.g., in
   "Grifflasche" ("Griff=lasche", i.e. gripping strap), you have to typeset
                 an "ff" ligature, but neither an "ffl" nor an "fl" ligature;
                 whilst, in
   "Giftflasche" ("Gift=flasche", i.e. poison phial), you have to typeset
                 an "ft" plus an "fl" ligature.
I think, the ZERO WIDTH NON-JOINER (200C) is meant to inhibit a possible
ligature (though annex D of ISO 10646 only refers to cursive connections).
For German, at least, a SOFT HYPHEN should take the same effect. (I do not
know the typesetting rules for other languages, though.)

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT