Answer to L2/02-426, SC2/WG2 N2531

Eric Muller, Adobe Systems Inc.
December 2, 2002

Document History

In a number of languages, hyphenation causes spelling changes. For example, the Swedish word tuggummi is hyphenated tugg/gummi.

The US comments on FPDAM 2 to ISO/IEC 10646-1:2000 (in L2/02-408) propose that when a word includes a U+00AD SOFT HYPHEN, the non-hyphenated spelling be encoded, e.g. tug<SHY>gummi, and that it is up to the rendering engine to determine what change of spelling, if any, is necessary.

Kent Karlsson proposes in L2/02-426, SC2/WG2 N2531, that instead the hyphenated spelling be encoded, e.g. tugg<SHY>gummi. The main argument for this approach is that it is possible to have simple rules to create the non-hyphenated spelling from the hyphenated spelling, while the converse is not possible.

Indeed, the converse cannot be achieved by simple rules: in pre-reform German effektvoll hyphenates between the two f without introducing a third f, while Schiffahrt introduces a third f.

However, the proposed rules do not work for Dutch and Catalan. Here are some hyphenations, what the proposed rules would compute, and what the result should be:

Language Hyphenated spelling Computation by proposed rules Correct
Dutch (old) financie<SHY>en financieen financiën
Dutch (old) idee<SHY>en ideen ideeën
Dutch (old,new) souper<SHY>tje soupertje soupeetje
Dutch (old,new) opa<SHY>tje opatje opaatje
Dutch (new) ange<SHY>erfde angeerfde angeërfde
Dutch (new) alibi<SHY>tje alibitje alibietje
Dutch (new) cafe<SHY>tje cafetje cafeetje
Dutch (new) depot<SHY>tje depottje depootje
Dutch (new) chalet<SHY>tje chalettje chaletje
Dutch (new) hobby<SHY>tje hobbytje hobby’tje
Catalan al<SHY>legro allegro al·legro

It should be clear from those examples that the proposed set of rules cannot be extended to cover those cases, even if the extensions are specific to the Dutch orthographies.

The recent German reform also creates some complications: the hyphenated form Sauerstoff-/Flasche corresponds to Sauerstofflasche (two f) in pre-reform spelling and to Sauerstoffflasche (three f) in post-reform spelling. N2531 solves that by including a U+200D ZERO WIDTH JOINER in the encoding of the post-reform form, to disable the rule that removes a consonant: Sauerstof<ZWJ>f<SHY>Flasche. This creates quite a burden on document authors, not to mention that the ZWJ has other effects on the rendering, such as encouraging the use of a ligated form for the characters it separates.

In the end, there is no set of simple rules to reliably compute one form from the other, especially when the orthography is unknown.

Encoding the non-hyphenated spelling works well for minimal rendering systems, which can simply ignore U+00AD SOFT HYPHEN characters, without any further processing and without having to know the orthography.

Furthermore, it is only when the author of a document also controls the layout of that document that the actual hyphenations may coincide with those places where he inserted a SHY, and the imperfections of any set of rules (one way or the other) will be masked. As soon as that control is not there (e.g. web pages), the vast majority of word occurrences will not be hyphenated.

Thus, is seems that encoding the non-hyphenated form is the best choice.

N2531 also touches on the U+058A ARMENIAN HYPHEN and U+1806 MONGOLIAN TODO SOFT HYPHEN. The proposal in the US comments is to make those characters behave just like U+2010 HYPHEN. If present in a document, they create break opportunities, A document author can indicate preferred hyphenation positions in Armenian and Mongolian text by using U+00AD SOFT HYPHEN, just like it is done in any other text. It is up to the rendering engine to decide which hyphen shape to use, based on the orthography.

