L2/08-023 Source: Asmus Freytag Date: 2008-01-16 Subject: Re: line breaking and SEAsia Martin has suggested that the Unicode Line Break algorithm could be improved by using a multi-level line-break opportunity instead of a binary scheme (present/absent). Specifically, he has suggested that there are three levels. Actually, the situation is rather more complex: 1) For truly high-end text layout one needs to calculate some sort of penalty function on line breaks that weights many factors, not only their immediate context (e.g. hyphenation vs. word-break) but also their location in the line (will they leave the line too tight or too loose) and finally their location in the paragraph (will they lead to rivers or orphaned words on the last line). In extreme cases, even the location on the page matters, as choice of line break can and does affect pagination. A fixed, three-level scheme does little to elevate the state of the art towards such high-end layout because the gradation remains based on each individual line break opportunity, instead of being a score that measures the overall quality of the line, paragraph or page (depending on the desired sophistication). Even relatively simple layout systems would want to consider the whole line when deciding whether to take or ignore a line break opportunity. They may decide to ignore a break based on automatic hyphenation, or even based on a hyphen if the resulting line remains within acceptable parameters, but prefer such a break over a space based break if the line would become too short (or would require unacceptable amounts of compression). They key thing here is that even the simplest such systems would evaluate adjacent candidates and that the distance (in layout coordinates, not character offsets) between candidates relative to the length of the line factors into the evaluation. A fixed multilevel scheme cannot substitute for this kind of analysis. 2) Because of this, the Unicode line break algorithm is designed to report *all* the legal line-break opportunities. It is specifically left to the layout algorithm to decide *which* of these line break opportunities are to be prioritized and how to decide among alternatives to arrive at the *actual* line break location for a given layout. (See the overview in UAX#14). Once an application has access to a list of legal line break opportunities in terms of offsets it is a simple matter to go back and look for the presence of SHY, or other such characters to help further (re-)prioritize such line breaks. If an automatic hyphenator were to supply not just SHY codes but special character codes (for example, non-characters code points), it would be a simple thing to tailor the linebreak implementation to support more detailed 'strength' hints for such automatically inserted line break opportunities. Such schemes would be limited to work inside a given implementation, but since they would require an effective common design of both the hyphenator and the final layout logic, that's perhaps not as much of an issue. 3) The role of the Unicode standard is in defining characters. The most important purpose of standardizing certain character properties and behaviors is in unambiguously identifying the character in question through its behavior, and to establish firm guidelines to users and implementers alike as to which characters to use in which context (and allowing for overriding requirements by orthographic rules). In that sense, a statement like "a SHY is an invisible character which defines a word-break opportunity, the rendering of which is language dependent" is just what's needed (paraphrased here, not quoted). Such a statement makes clear what kind of animal the SHY is (and by exclusion, what it isn't) and when it's expected to be used. The precise rendering of texts including a SHY would depend on applying additional rules: those of the orthography in question (usually defined by the language) but also those required to achieve the desired typographical quality of the output. A basic system might honor each SHY that happens to result in a maximal, but not overfull line. A more sophisticated system would consider the desirability of an intra-word break compared to the resulting looseness of a line if a line break opportunity based on SHY was not used, etc. Hyphenated wordbreaks at the end of a line are a necessity in narrow columns, but for long lines, text looks better without them, for example. Unicode has no business in regulating either orthographies or typographical support, other than to define a basic level where necessary. Because of that, the Unicode line break algorithm was carefully designed to not require more than very basic typographic behavior, but at the same time, its rules are designed to (more or less) define the common basic functionality, which in turn defines the basic nature of various characters (esp. where it might not be immediately obvious to the naive user whether a given character might have been conceived as suitable in a given context). 4) Some characters are solely (or primarily) defined via their line breaking behavior. For example, ZWSP and WJ exist primarily to unequivocally enable or disable breaks. The Unicode line break specification for such characters should be understood as providing the complete (and essentially mandatory) specification of their behavior, including their relative precedence (ZWSP overrides WJ and not the other way around). By adhering to these core specifications, implementations can assure that users creating texts on one system can correctly predict the effect of such line breaking overrides on all other systems. In conclusion: Layout mechanisms that treat line-breaks as all-or-nothing are very basic indeed. It is easy to derive many schemes to improve on their output. Truly high-quality layout is achieved only by considering large contexts, such as the whole line, the whole paragraph, the whole page or whole text, and working out where the 'ink' will end up on the page. The Unicode linebreak algorithm is not designed to solve that problem. It is deliberately tuned to give all legal linebreak opportunities, requiring the line-fitting algorithm that logically follows it in the layout process to make a selection based on factors that are external to the line break algorithm as such. In the actual implementation of a high end layout system, it may be desirable to provide a way for a hyphenator to communicate some inherent desirability metric for each automatically generated line break opportunity to the layout algorithm. The details of that are by nature implementation specific and don't require (or even benefit) from standardization by the Unicode Consortium. However, for such private notification, one implementation technique would be to insert non-character code points can be temporarily into the text stream. That was one of the possible usage scenarios foreseen for such codes when they were added to the Unicode Standard. A./