L2/08-023

Source: Asmus Freytag
Date: 2008-01-16
Subject: Re: line breaking and SEAsia

Martin has suggested that the Unicode Line Break algorithm could be 
improved by using a multi-level line-break opportunity instead of a 
binary scheme (present/absent). Specifically, he has suggested that 
there are three levels.

Actually, the situation is rather more complex:

1) For truly high-end text layout one needs to calculate some sort of 
penalty function on line breaks that weights many factors, not only 
their immediate context (e.g. hyphenation vs. word-break) but also their 
location in the line (will they leave the line too tight or too loose) 
and finally their location in the paragraph (will they lead to rivers or 
orphaned words on the last line). In extreme cases, even the location on 
the page matters, as choice of line break can and does affect pagination.

A fixed, three-level scheme does little to elevate the state of the art 
towards such high-end layout because the gradation remains based on each 
individual line break opportunity, instead of being a score that 
measures the overall quality of the line, paragraph or page (depending 
on the desired sophistication). Even relatively simple layout systems 
would want to consider the whole line when deciding whether to take or 
ignore a line break opportunity. They may decide to ignore a break based 
on automatic hyphenation, or even based on a hyphen if the resulting 
line remains within acceptable parameters, but prefer such a break over 
a space based break if the line would become too short (or would require 
unacceptable amounts of compression).

They key thing here is that even the simplest such systems would 
evaluate adjacent candidates and that the distance (in layout 
coordinates, not character offsets) between candidates relative to the 
length of the line factors into the evaluation. A fixed multilevel 
scheme cannot substitute for this kind of analysis.

2) Because of this, the Unicode line break algorithm is designed to 
report *all* the legal line-break opportunities. It  is specifically 
left to the layout algorithm to decide *which* of these line break 
opportunities are to be prioritized and how to decide among alternatives 
to arrive at the *actual* line break location for a given layout. (See 
the overview in UAX#14).

Once an application has access to a list of legal line break 
opportunities in terms of offsets it is a simple matter to go back and 
look for the presence of SHY, or other such characters to help further 
(re-)prioritize such line breaks. If an automatic hyphenator were to 
supply not just SHY codes but special character codes (for example, 
non-characters code points), it would be a simple thing to tailor the 
linebreak implementation to support more detailed 'strength' hints for 
such automatically inserted line break opportunities. Such schemes would 
be limited to work inside a given implementation, but since they would 
require an effective common design of both the hyphenator and the final 
layout logic, that's perhaps not as much of an issue.

3) The role of the Unicode standard is in defining characters. The most 
important purpose of standardizing certain character properties and 
behaviors is in unambiguously identifying the character in question 
through its behavior, and to establish firm guidelines to users and 
implementers alike as to which characters to use in which context (and 
allowing for overriding requirements by orthographic rules).

In that sense, a statement like "a SHY is an invisible character which 
defines a word-break opportunity, the rendering of which is language 
dependent" is just what's needed (paraphrased here, not quoted). Such a 
statement makes clear what kind of animal the SHY is (and by exclusion, 
what it isn't) and when it's expected to be used.

The precise rendering of texts including a SHY would depend on applying 
additional rules: those of the orthography in question (usually defined 
by the language) but also those required to achieve the desired 
typographical quality of the output. A basic system might honor each SHY 
that happens to result in a maximal, but not overfull line. A more 
sophisticated system would consider the desirability of an intra-word 
break compared to the resulting looseness of a line if a line break 
opportunity based on SHY was not used, etc. Hyphenated wordbreaks at the 
end of a line are a necessity in narrow columns, but for long lines, 
text looks better without them, for example.

Unicode has no business in regulating either orthographies or 
typographical support, other than to define a basic level where 
necessary. Because of that, the Unicode line break algorithm was 
carefully designed to not require more than very basic typographic 
behavior, but at the same time, its rules are designed to (more or less) 
define the common basic functionality, which in turn defines the basic 
nature of various characters (esp. where it might not be immediately 
obvious to the naive user whether a given character might have been 
conceived as suitable in a given context).

4) Some characters are solely (or primarily) defined via their line 
breaking behavior. For example, ZWSP and WJ exist primarily to 
unequivocally enable or disable breaks. The Unicode line break 
specification for such characters should be understood as providing the 
complete (and essentially mandatory) specification of their behavior, 
including their relative precedence (ZWSP overrides WJ and not the other 
way around).

By adhering to these core specifications, implementations can assure 
that users creating texts on one system can correctly predict the effect 
of such line breaking overrides on all other systems.

In conclusion:

Layout mechanisms that treat line-breaks as all-or-nothing are very 
basic indeed. It is easy to derive many schemes to improve on their 
output. Truly high-quality layout is achieved only by considering large 
contexts, such as the whole line, the whole paragraph, the whole page or 
whole text, and working out where the 'ink' will end up on the page. The 
Unicode linebreak algorithm is not designed to solve that problem. It is 
deliberately tuned to give all legal linebreak opportunities, requiring 
the line-fitting algorithm that logically follows it in the layout 
process to make a selection based on factors that are external to the 
line break algorithm as such. In the actual implementation of a high end 
layout system, it may be desirable to provide a way for a hyphenator to 
communicate some inherent desirability metric for each automatically 
generated line break opportunity to the layout algorithm. The details of 
that are by nature implementation specific and don't require (or even 
benefit) from standardization by the Unicode Consortium. However, for 
such private notification, one implementation technique would be to 
insert non-character code points can be temporarily into the text 
stream. That was one of the possible usage scenarios foreseen for such 
codes when they were added to the Unicode Standard.

A./