L2/07-028

From: Asmus Freytag
Date: 2007-01-26
Re: Proposal for an update to the line break algorithm in UAX#14
 

Proposal for an update to the line break algorithm in UAX#14
for discussion at UTC#110.

Proposal

Change the line break rule for nonbreaking characters.

Existing:

LB12 Do not break before or after NBSP and related characters.

[^SP] × GL

GL ×

Proposed:

LB12a Do not break after NBSP and related characters.

GL ×

LB12b Do not break before NBSP and related characters.

[^SP, BA] × GL

Additionally, move rule 12b from the non-tailorable part of the line break rules to the tailorable part of the line break rules.

Rationale

Making this limited change will allow existing, long-standing practice be conformant with UAX#14. Making these implementations non-compliant, even with tailoring, was never the intent.

While the proposal allows hyphens (class BA) to override the effect of a following GL (non brekaing) character, it retains the concept that the class GL represents characters with important, normative properties, i.e. that of being non-breaking.

Because allowing a break after hyphen, SHY, etc. in front of NBHY etc. is useful for some languages in its own right, the proposal also recommends that the default rule be changed to recognize not only SP but also BA as overriding the non-breaking nature of a following GL character. (See Background).

WJ can be used in a context <BA, WJ, GL> where true non-breaking behavior following a BA is required. Additionally, moving rule 12b to the tailorable part of the rules, allows implementations to adjust this behavior further (as well as allow Unicode 5.0.0 compliant implementations to retain compliance via declaration of a tailoring that doesn't require changes in their code).

No changes in assigned properties are proposed.

Background

There are linebreaking conventions that modify the appearance of a line break when the line break opportunity is based on an explicit hyphen. In Polish, explicit hyphens are always promoted to the next line if a line break occurs at that location in the text. For example, if, given the sentence "Tam wisi czerwono-niebieska flaga" ("There hangs a red-blue flag"), the optimal line break occurs at the location of the explicit hyphen, an additional hyphen will be displayed at the beginning of the next line like this:

Tam wisi czerwono-
-niebieska flaga.

The same convention is used in Portuguese, where the use of hyphens is commone, because it is mandatory for verbs forms that include a pronoun. There are examples where homographs or ambiguity may arise if hyphens are treated incorrectly: "disparate" means "folly" while "dispara-te" means "fire yourself" (or "fires onto you"). Therefore the former needs to be line broken as

dispara-
te

and the latter as

dispara-
-te.

The practice of typing <SHY, NBHY> instead of <HYPHEN> to achieve promotion of the hyphen to the next line is reportedly common and is supported by several major text layout applications and at least one major browser.

However, this is not supported by the algorithm as specified in version 5.0.0 of UAX#14, and tailoring of the properties of NBHY are not permitted.

The same software investigated also supports breaking in the case of <HYPHEN, NBSP>, <HYPHEN, NBHY> etc., therefore this behavior is not limited to contexts involving SHY and cannot be addressed by a more narrowly tailored proposal.