Yes, SOFT HYPHEN is a hard problem

L2/02-279

Yes, SOFT HYPHEN is a hard problem

Eric Muller, Adobe Systems Inc.
August 14, 2002

1.	What Unicode 3.2 says
2.	Interpreting Unicode 3.2
3.	Another interpretation
4.	Rendering hyphenation
5.	How to represent and manipulate hyphenation?
6.	The problems
7.	Moving forward

Document History

1. What Unicode 3.2 says

The character U+00AD SOFT HYPHEN is described as:

U+00AD SOFT HYPHEN indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line-break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible).
Unicode 3.0, Section 13.2, page 315

SHY is rendered invisibly and has no width, except at a line break. The rendering of the soft hyphen depends on the script. For the Latin script it is rendered as a hyphen, however, some languages require a change in spelling surrounding an optional hyphen, if it occurs at a line break. For example in Swedish the word “tuggummi” changes to “tugg-gummi” when hyphenated.
UAX#14, section 5.1

00AD;SOFT HYPHEN;Pd;0;ON;;;;;N;;;;;
UCD

Two other characters are similar. The first is U+058A ARMENIAN HYPHEN:

The behavior of the latter character [U+058A] is similar to U+00AD SOFT HYPHEN. It is used to indicate a line-breaking opportunity within a polysyllabic Armenian word. Its shape distinguishes it from the soft hyphen.
Unicode 3.0, Section 7.4, page 172

Hyphens are graphic characters with width. Since, unlike spaces, they print, they are included in the measured part of the preceding line
UAX#14, section 5.1

058A;ARMENIAN HYPHEN;Pd;0;ON;;;;;N;;;;;
UCD

The other is U+1806 MONGOLIAN TODO SOFT HYPHEN:

In Mongolian Todo text, U+1806 MONGOLIAN TODO SOFT HYPHEN is used at the beginning of the second line to indicate resumption of the broken word. It functions like a U+00AD SOFT HYPHEN, except that this version appears at the beginning of a line rather than at the end.
Unicode 3.0, Section 11.4, page 290

The Mongolian Todo soft hyphen indicates an optional line break opportunity with hyphen, but unlike the soft-hyphen it stays with the following line.
UAX#14, section 5.1

1806;MONGOLIAN TODO SOFT HYPHEN;Pd;0;ON;;;;;N;;;;;
UCD

2. Interpreting Unicode 3.2

Consider U+058A and U+1806: it seems clear that those two characters can be rendered, and that when they are, it must be using specific shapes at specific positions. This is the most likely interpretation that is compatible with the existence of two separate characters, with the Unicode 3.0 text, with the code charts and with the Pd category.

I would like to stress the choice of words here: “these characters can be rendered” is much stronger than “these characters can have an effect on the rendering of the string in which they are present.”

Those characters are contrasted with U+00AD, and the contrast is expressed only in terms of rendering. Also, the wording of Unicode 3.0 “the visible rendering of this character”, and the wording of UAX#14 which could be paraphrased by “SHY is rendered visibly and has width at a line break.” strongly suggests that U+00AD should also be understood as generally rendered, with the possible degenerate case of no ink in some scripts.

At this point, one may understand those three characters as having two functions: the first is to indicate in plain text a possible hyphenation point; the second is to specify a shape and/or placement for ink that depicts hyphenation, should the rendering system decide to hyphenate at that position.

The plot thickens a little bit when one asks “when U+00AD is visible, what is the shape and/or position of its rendering?” Unicode 3.0 is deliciously ambiguous on that point, the code chart is silent, and UAX#14 is helpfully pointing out the slippery slope (i.e. that U+00AD can have a visual effect on other characters as well).

3. Another interpretation

Jukka Korpela, in “Soft hyphen (SHY) - a hard problem?” (http://www.cs.tut.fi/~jkorpela/shy.html), noted the description of the soft hyphen in ISO 8859-1:

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen, for use when a line break has been established within a word.

He then concludes:

Thus, soft hyphen is a visible (graphic) character, not an invisible hyphenation hint. Soft hyphen is not related to any word division process to be applied to the text but may indicate what has happened in such a process when the text was produced.

He also reports what other standards say, and how various applications handle U+00AD.

Kent Karlsson addressed that interpretation in WG 3 document N 506, “SOFT HYPHEN and some other characters” (http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n506.pdf) and proposed to clarify soft hyphen:

SOFT HYPHEN (00AD): SOFT HYPHEN (SHY) allows an automatic line break to be established just after it (like ZERO WIDTH SPACE). SOFT HYPHEN is imaged by a graphic symbol identical with that representing HYPHEN when an automatic line break has been established just after it, or if it is directly followed by an explicit line break (including end-of-string). When an automatic line break has not been established just after it, nor is it followed by an explicit line break, the SOFT HYPHEN is not rendered and has zero width.
Note: In certain combinations, e.g., webb<SHY>besökare, the SOFT HYPHEN can in addition suppress the letter following the SOFT HYPHEN when the SOFT HYPHEN is not rendered (e.g. webbesökare). Such behaviour is similar to automatic ligature formation.
MONGOLIAN TODO SOFT HYPHEN (1806): MONGOLIAN TODO SOFT HYPHEN allows an automatic line break to be established just before it. MONGOLIAN TODO SOFT HYPHEN is imaged by a graphic symbol identical with that representing HYPHEN when an automatic line break has been established just before it, or if it is directly preceded by an explicit line break (including beginning-of-string). When an automatic line break has not been established just before it, nor is it preceded by an explicit line break, the MONGOLIAN TODO SOFT HYPHEN is not rendered and has zero width.

4. Rendering hyphenation

When a word is actually hyphenated, the rendering can take many forms:

In most cases, some mark is added at the end of the first line, and nothing is added to the second line. That mark can take many different shapes, including the usual hyphen, a two-strokes hyphen, and the Armenian hyphen.
I believe that in some orthographies, a mark is added both at the end of the first line and at the beginning of the second.
In Mongolian, a mark is added at the beginning of the second line only.
In some cases, no mark is added.

When a mark is added at the end of the first line, and the text is right-justified, the mark is often included in the measure of the line, that is, its right side is aligned with the right margin. However, there are typographic styles in which it is not included in the measure and the mark is placed in the margin (along with other punctuation marks); this is called hanging punctuation. A similar situation could exists for marks at the beginning of the second line in left-justified text.

In addition, the spelling of the hyphenated word can be altered:

in traditional German, a “c” before the hyphenation point can change into a “k”: “Drucker” hyphenates into “Druk” / “ker”
in modern Dutch, a “ë” after the hyphenation point can change into a simple “e”: “angeërfde” hyphenates into “ange” / “erfde”
in German and Swedish, a consonnant is sometimes doubled: “tuggummi” hyphenates into “tugg” / “gummi”
in Dutch, a letter can disappear: “opaatje” hyphenates into “opa” / “tje”

It is a safe bet that this does not cover all the peculiarities of rendering hyphenation.

5. How to represent and manipulate hyphenation?

All this situation is further complicated by the fact that there are different methods of text encoding and manipulation.

At one end of the spectrum, the encoded text does not include a choice of line breaking. That choice is left entirely to the rendering engine, including the possibility of hyphenating words. This situation is common in, say, InDesign documents.

At the other end of the spectrum, the encoded text does include that choice. The marks of hyphenation are explicitly encoded and the spelling reflects hyphenation. The advantage of that approach is that the demands on rendering systems are minimal.

In between, the encoding may include exceptions or hints. For example, HTML supports the representation of explicit line breaks, which are combined with the line breaks determined by the rendering system. Conversely, one may wish to add “optional” line breaks to an otherwise specified line breaking.

In addition, it is sometimes desirable to take a text that includes a choice of line breaks, to ignore those choices and to perform a different layout. Many email user agents allow this reflowing of the text. This operation is a little bit tricky, because one must determine how to undo the hyphenation of words, which can include the suppression of marks, as well as the restoration of the unhyphenated spelling. Jukka’s interpretation helps there: an hyphen that is part of the non-hyphenated orthography is represented by a regular HYPHEN, while one that is introduced by hyphenation is represented by a SOFT HYPHEN. It is thus possible to determine whether a hyphen shape should be preserved or not when undoing the line breaking. However, there is no comprehensive strategy to indicate how the spelling should be affected if an hyphenation is to be discarded.

Conversely, when possible hyphenation points are indicated in the text, and hyphenation at those points affects the spelling, there is no comprehensive strategy to indicate the alternate spellings. In fact, there is even divergence on the choice of the encoded spelling: Karl’s example of “webbesökare” suggests to encode the post-hyphenation spelling. UAX#14’s example of “tuggummi” suggests to encode the pre-hyphenation spelling!

6. The problems

First, the Unicode standard is not entirely clear on the function and behavior of the characters in question.

Second, under the most likely interpretation (that of section 2), those characters have overloaded functionality and Unicode history tells us that this is a source of problems.

Third, the visible effect of hyphenation can vary considerably. I personnally think that any attempt by the Unicode standard to mandate how an hyphenation should be rendered is asking for trouble. By action or by omission, such a mandate is bound to be misleading at best, and worse, to force implementers in non-conformance.

Fourth, the concept of encoding possible hyphenation points in orthographies where the spelling is affected by hyphenation is troublesome. If an engine is smart enough to know how spelling is affected, then it is hard to imagine that it is not smart enough to know without help that hyphenation is possible. Turned around, this argument says that if encoding possible hyphenation points is desirable, then one must also have a way to encode the effect on spelling.

Fifth, the interaction between possible hyphenation points indicated in the text and possible hyphenation points computed by layout engines is not entirely clear. UAX#14 only indicates what some engines do.

7. Moving forward

It seems that there is no complete and satisfactory solution to the problems. However, we can solve some of these by taking the following steps.

First, we can remove the double semantic on characters:

make U+00AD be a pure control character, by giving it only the first job: communicate to the rendering system where hyphenation is possible
make U+058A and U+1806 be “ordinary” characters (in the end, they should be just like U+2010 HYPHEN)

It seems best to preserve the current use of U+00AD, and to preserve the current pictures attached to U+058A and U+1806. It also make the handling of additional characters that may be used as hyphens more straightforward, since we would not need to update all the places that discuss U+00AD.

Also, we should:

delegate entirely to the layout engines the choice of how an actual hyphenation (whether induced by U+00AD or by computation) should be rendered

There are just too many complications and situations for us to do a good job. Of course, we can alert implementers that the situation is complicated, may be by describing the variety of scenarios.

This does not solve all the problems, but I think we will be in a better shape. In particular, we localize the problems on U+00AD and free U+058A and U+1806. We also give to rendering systems the freedom they need, and do not force them to non-conformance.

While limited, those changes may be more than we can tolerate. Clearly, they affect the semantic of existing documents. It may be useful to observe that implementations vary widely today, and that therefore there may be little to break anyway.

Document History

Author: Eric Muller

Revision	Date	Comments
1	August 14, 2002	Initial version

L2/02-279