L2/08-022

Source: Martin Hosken
Date: 2008-01-15
Subject: Line breaking and SEAsia

Dear All,

If this is worth discussion, can someone give it an L2 number and add it to the agenda.

South East Asian languages are interesting from the perspective of line breaking in
that many of the scripts used to write them have no inter word spaces. We can included
Chinese and even Lisu (which has spaces between syllables but does not mark word
boundaries beyond that) along with Khmer, Thai and Burmese in the list of scripts
and languages of interest.

The general approach to line breaking, taken in such languages, is that there are
3 levels of line break opportunity: Punctuation break (including space in some
languages), word break and at the weakest, syllable break. In Latin script languages,
the first two are conflated, since a space marks a word boundary (in summary). But
Latin script does have the weakest, in the form of a hyphenation break.

The Unicode Line Breaking Properties (UAX #14) takes the view that the issue of
line breaking is a binary one: Can we break here or not? It waves a little at the
issue of hyphenation but then assigns soft hyphen (U+00AD) to being identical to,
say, a pause or sentence break (U+104B).

I would like to propose the three level line breaking approach such that UAX #14
categorises its properties according to these levels.

If we consider each level in turn and what characters are involved in line breaking
opportunities at that level, I would suggest the following:

Punctuation break: All punctuation marks and space
Word break: ZWSP
Hyphen break: hyphen, Soft hyphen
Inhibit break: WJ

Examining this list, we see that there is no invisible hyphen break marking character.
The soft hyphen is closest but it causes a hyphen to be rendered when occurring at a
line boundary. What is needed is an invisible soft hyphen, or something equivalent.

The advantage of going with this 3 level model is that it allows language specific
processing to insert appropriately weighted breaks algorithmically, which is what
happens already. Thus in English a hyphenation algorithm can insert a hyphen weight
break (with the corresponding hyphen output on breaking). Other languages can often
identify syllable breaks (or intra word breaks) algorithmically relatively simply,
but have trouble identifying inter word breaks that are not punctuation breaks.
Processes that are deciding on where to break a line may use whichever level of break
they deem appropriate and may even use the information to help balance paragraphs
and line breaks nicely, or not as they see fit.

All in all, I suggest that a 3 level approach better fits the data than a binary:
break here or not? model.

Yours,
Martin Hosken