L2/05-062

Proposed Updates To 4.1.0 Linebreak Properties

Author: Asmus Freytag
Date: 2005-02-06
Revision: 2

This document contains my summary of how to resolve beta feedback on
line break properties, including information from the discussion with various
proposal submitters and others with expert knowledge.

A designation "LineBreak-4.1.0d7.txt:" means that no property had been
assigned up to that point (using a ** in the data file as placeholder).
All other characters mentioned here have had tentative beta assignments
or are existing characters (pre 4.1.0).

Compared to the original revision of this document which was circulated
on the unicore list, the HEBREW, KHAROSHTHI and OTHER sections have
been substantially revised, based on feedback received.

HEBREW

LineBreak-4.1.0d7.txt:05C6;** # HEBREW PUNCTUATION NUN HAFUKHA

This comes after a word and should not allow a break even if separated
by a space according to Peter Kirk (which would require a line break value
of EX).
Mark Shoulson thinks it can occur in a bracketing way (i.e. start a line).
If that is true, BA would be more appropriate.

Change ** -> BA

ARABIC

As a result of a review of these properties on the bidi list, I propose to
change these existing characters from AL to EX for 4.1.0

060C;AL->EX # ARABIC COMMA
061B;AL->EX # ARABIC SEMICOLON
061F;AL->EX # ARABIC QUESTION MARK
066A;AL->EX # ARABIC PERCENT SIGN
06D4;AL->EX # ARABIC FULL STOP

These are only used as sentence ending punctuation and are not used as part
of numbers, which makes them similar to ! and ?

Kamal Mansour writes: "In traditional Arabic typography, one often sees spaces
surrounding a punctuation mark such as comma or any of the others above. Over
the past decade, DTP has somewhat reduced the frequency of this practice, but f
or the purpose of an algorithm, one couldn't count on the lack of white space
between a word and an adjoining punctuation mark. The situation for Arabic
would not be so different from French practice with regard to spacing around
punctuation."

This is precisely the reason for which the EX class was designed.

The effect of this would be to allow no linebreaks before these characters,
even if preceded by whitespace, and slightly more linebreaks after them,
in particular if directly followed by letters or numbers.

The same rationale holds for this newly assigned character:

LineBreak-4.1.0d7.txt:061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK

ETHIOPIC

LineBreak-4.1.0d7.txt:1360;AL # ETHIOPIC SECTION MARK

Daniel Yacob writes: "this is used like a dingbat, therefore AL is appropriate:
Only white space or another section mark should appear on a line
with a section mark.  Simulating the section mark with an asterisk,
example usage would be:

 :    :    :    :    :   :    :
Abcd efgh ijkl mnop qrst uvwx yz.

         *  *  *  *  *

Zyxw vuts rqpo nmlk jihg fedc ba.
 :    :    :    :    :   :    :                "

MONGOLIAN (FYI)

The existing classification of Mongolian Punctuation is
unusual in that it classifies them all the same as letters.
This seems to be an oversight. However, there is not yet
conclusive evidence in favor of a better recommendation.

Current status:

1800;AL # MONGOLIAN BIRGA
1801;AL # MONGOLIAN ELLIPSIS
1802;AL # MONGOLIAN COMMA
1803;AL # MONGOLIAN FULL STOP
1804;AL # MONGOLIAN COLON
1805;AL # MONGOLIAN FOUR DOTS
1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
1808;AL # MONGOLIAN MANCHU COMMA
1809;AL # MONGOLIAN MANCHU FULL STOP

I would have expected to see EX, or even IS for most of these, or
potentially BA.

NEW TAI LUE

LineBreak-4.1.0d7.txt:19DE;AL # NEW TAI LUE SIGN LAE
LineBreak-4.1.0d7.txt:19DF;AL # NEW TAI LUE SIGN LAEV

Their general category is given as "Po" in the proposal,
but that may be incorrect, as the proposal author states
categorically: "These are letters"

Change GC from Po --> Lo

BUGINESE

LineBreak-4.1.0d7.txt:1A1E;BA # BUGINESE PALLAWA
LineBreak-4.1.0d7.txt:1A1F;AL # BUGINESE END OF SECTION

Based on data from the proposal, I suggest we treat the first
as BA as the proposal states an analogy to period and comma
and the second as AL as it seems similar in use to the Paragraph
mark in the only examples shown.

COPTIC

Detailed information on linebreak behavior is still lacking for these
characters, but the following presents my best 'guess' of line
break property based on suspected analogy (mainly by name)
and the fact that Coptic also uses recently added General
punctuation with similar behavior as proposed here.

LineBreak-4.1.0d7.txt:2CF9;BA # COPTIC OLD NUBIAN FULL STOP
LineBreak-4.1.0d7.txt:2CFA;BA # COPTIC OLD NUBIAN DIRECT QUESTION MARK
LineBreak-4.1.0d7.txt:2CFB;BA # COPTIC OLD NUBIAN INDIRECT QUESTION MARK
LineBreak-4.1.0d7.txt:2CFC;BA # COPTIC OLD NUBIAN VERSE DIVIDER
LineBreak-4.1.0d7.txt:2CFD;AL # COPTIC FRACTION ONE HALF
LineBreak-4.1.0d7.txt:2CFE;BA # COPTIC FULL STOP
LineBreak-4.1.0d7.txt:2CFF;BA # COPTIC MORPHOLOGICAL DIVIDER

Rationale: unless there's a need to support punctuation separated by
a space from the preceding letter, BA is a reasonable choice for
dividing or sentence ending punctuation. (Otherwise, EX might have
been preferable).

Using AL for the fraction keeps it together with numbers or words
without triggering special rules for numeric punctuation.

GREEK PUNCTUATION

Information is still lacking for this character:

LineBreak-4.1.0d7.txt:2E16;** # DOTTED RIGHT-POINTING ANGLE

If no other information comes forward during the UTC meeting I
suggest we treat this as AL. There seems to be no reason to let
it allow breaks after, and I would want confirmation before allowing
breaks before as a default.

(It's used as an editorial pointer or marker: "diple periestigmene").

KHAROSHTHI

Andrew suggests: " To summarize the ... script: line breaks may
occur in any position except before a dependent sign, that is to say not between
a sign and a combining vowel diacritic or other combining modifier
(this is probably the same as with Devanagari, only Kharosthi is Right to Left).
Breaks between consecutive numbers are avoided."

This would suggest that independent letters be treated as ID, not AL, however
in a second message Andrew suggested that for scholarly use, AL is the
better default, therefore no change from the beta.
Also, the numbers should remain AL. (not NU, as that is reserved for
decimal digits that interact with decimal punctuation).

For punctuation he writes: "All punctuation signs should  break after the sign,
so that the sign should not occur at the beginning of a line. The exception
to this is 10A58 # KHAROSHTHI PUNCTUATION LINES, which only occurs at
the beginning of a line, but in this case may be set off by a hard return."

This would most easily be accomplished by using BA and AL for the
"LINES"

LineBreak-4.1.0d7.txt:10A50;BA # KHAROSHTHI PUNCTUATION DOT
LineBreak-4.1.0d7.txt:10A51;BA # KHAROSHTHI PUNCTUATION SMALL CIRCLE
LineBreak-4.1.0d7.txt:10A52;BA # KHAROSHTHI PUNCTUATION CIRCLE
LineBreak-4.1.0d7.txt:10A53;BA # KHAROSHTHI PUNCTUATION CRESCENT BAR
LineBreak-4.1.0d7.txt:10A54;BA # KHAROSHTHI PUNCTUATION MANGALAM
LineBreak-4.1.0d7.txt:10A55;BA # KHAROSHTHI PUNCTUATION LOTUS
LineBreak-4.1.0d7.txt:10A58;AL # KHAROSHTHI PUNCTUATION LINES

YI

As a result of changing general category for this character, its LB property
was adjusted from ID to NS in analogy to U+3005. I'm noting this here to
make sure that this is covered by a UTC decision (as I didn't see anything
for this in the minutes of the last meeting)

A015;ID->NS # YI SYLLABLE WU

TIBETAN

There are potentially some issues with the Tibetan line break properties
as currently assigned in the standard. The beta file makes some changes,
the text of UAX#14 suggests some additional changes. These need to be reconciled.

[I plan to review this issue and provide a revision of this document].


OTHER

All other changes of Linebreak properties for 4.1.0 relative to 4.0.1 are
already documented both in the beta data file and in the proposed
update for UAX#14 that's been out for review.

By default, for newly encoded characters:
o all Letters and ordinary symbols are given AL
o all decimal digits are given NU
o all combining marks are given CM
o currency symbols are given PO or PR (postfix or prefix)
o all brackets are given OP or CL (open or close)
o most sentence or phrase-ending punctuation is given BA
o all ambiguous quotation marks are given QU
o all wide characters are given ID (ideographic)

Where a clear analog exists in another script, the default
assignment would be to match.

The current document discusses only those cases where a different
choice was made or the exact behavior of a character was
not clear from the outset.