L2/08-022 Source: Martin Hosken Date: 2008-01-15 Subject: Line breaking and SEAsia Dear All, If this is worth discussion, can someone give it an L2 number and add it to the agenda. South East Asian languages are interesting from the perspective of line breaking in that many of the scripts used to write them have no inter word spaces. We can included Chinese and even Lisu (which has spaces between syllables but does not mark word boundaries beyond that) along with Khmer, Thai and Burmese in the list of scripts and languages of interest. The general approach to line breaking, taken in such languages, is that there are 3 levels of line break opportunity: Punctuation break (including space in some languages), word break and at the weakest, syllable break. In Latin script languages, the first two are conflated, since a space marks a word boundary (in summary). But Latin script does have the weakest, in the form of a hyphenation break. The Unicode Line Breaking Properties (UAX #14) takes the view that the issue of line breaking is a binary one: Can we break here or not? It waves a little at the issue of hyphenation but then assigns soft hyphen (U+00AD) to being identical to, say, a pause or sentence break (U+104B). I would like to propose the three level line breaking approach such that UAX #14 categorises its properties according to these levels. If we consider each level in turn and what characters are involved in line breaking opportunities at that level, I would suggest the following: Punctuation break: All punctuation marks and space Word break: ZWSP Hyphen break: hyphen, Soft hyphen Inhibit break: WJ Examining this list, we see that there is no invisible hyphen break marking character. The soft hyphen is closest but it causes a hyphen to be rendered when occurring at a line boundary. What is needed is an invisible soft hyphen, or something equivalent. The advantage of going with this 3 level model is that it allows language specific processing to insert appropriately weighted breaks algorithmically, which is what happens already. Thus in English a hyphenation algorithm can insert a hyphen weight break (with the corresponding hyphen output on breaking). Other languages can often identify syllable breaks (or intra word breaks) algorithmically relatively simply, but have trouble identifying inter word breaks that are not punctuation breaks. Processes that are deciding on where to break a line may use whichever level of break they deem appropriate and may even use the information to help balance paragraphs and line breaks nicely, or not as they see fit. All in all, I suggest that a 3 level approach better fits the data than a binary: break here or not? model. Yours, Martin Hosken