186 Word-Joining Hyphen 2011.07.25
Status: Closed
Originator: UTC
Resolution: Text will be added to UAX #29 for Unicode Version  6.1 to say that in those orthographies where hyphens are used interior to words, it is appropriate to tailor hyphen to be a "MidLetter".

Description of Issue:

Line breaking and word breaking are different operations: a position in a particular text may allow a line break but not a word break, while another position in that text may allow a word break but not a line break. For example, break positions are indicated with "|" in the following table:

Text Word break Line break Comment
co‧or‧di‧nate |co‧or‧di‧nate| co‧|or‧|di‧|nate| Using a hyphenation point character
the Form A |the| |Form| |A| the |Form  A| U+00A0 no-break space between "Form" and "A"

The Unicode Standard has different character properties for line-break and word-break behavior that reflect these differences. For more information, see UAX #14, and Word Boundaries in UAX #29, and an online demo.

The Unicode Standard specifies that U+2011 NON-BREAKING HYPHEN disallows line breaks. The Unicode Technical Committee is currently considering whether this non-breaking behavior should be broadened to also affect word breaking behavior.

For example, there are a number of writing systems that use a hyphen character between syllables within a word. An example is the Iu Mien language written with the Thai script. Such words should behave as single words for the purpose of selection ("double-click"), indexing, and so forth, meaning that they should not word-break on the hyphen.

The suggested change is that U+2011 NON-BREAKING HYPHEN be given the word-break property MidLetter.

The advantage of making this change is that U+2011 NON-BREAKING HYPHEN could be used in orthographies that contain interior hyphens. This would avoid a requirement to encode yet another confusable hyphen/dash/minus character to the over-a-dozen already in Unicode.

The disadvantage of making this change is that hyphens are also used to link separate words, such as in "over-a-dozen". In well-known English usage, for example, hyphens are used in attributive compound adjectives. Suppose that a user has used a non-breaking hyphen in such a case to prevent a bad line break. Changing the word-breaking behavior of the non-breaking hyphen would change the interpretation of the construction to wrongly indicate that there was one word instead of two.

The UTC would appreciate feedback on the pros and cons of the different alternatives for dealing with this issue.

