PROPOSED UPDATES TO 4.1.0 LINEBREAK PROPERTIES

Author: Asmus Freytag
Date: 2005-02-06
Revision: 2					L2/05-062
Date: 2005-02-08					
Revision: 3					L2/05-062R

This document contains my summary of how to resolve beta feedback on
line break properties, including information from the discussion with various
proposal submitters and others with expert knowledge.

It also includes some proposed changes to GC for consistency.

A designation "LineBreak-4.1.0d7.txt:" means that no property had been
assigned up to that point (using a ** in the data file as placeholder).
All other characters mentioned here have had tentative beta assignments 
or are existing characters (pre 4.1.0).

CONTROL CODES
(This has been updated from Rev. 2)

Kent Karlsson writes: "the following should be listed as BK...since at 
least the bidi alg. considers them to be paragraph boundaries."

000B;CM->BK # <control> tab
000C;BK # <control> ff
001C;CM->BK # <control> information separator four
001D;CM->BK # <control> information separator three 
001E;CM->BK # <control> information separator two
0085;NL->BK # <control> next line

We should not make TAB a mandatory line break, since it doesn't break lines.
We should not make the IS2-4 mandatory line breaks as they are not part
of the set of characters used in text files which need to be line broken.

The classes NL and BK work the same, so it would be 
possible to remove NL, but it doesn't really matter.

Recommend: no change.

We should reject his related comment to delete name comments in 
UnicodeData.txt for control codes.

HEBREW
(This has been updated from Rev. 2)

LineBreak-4.1.0d7.txt:05C6;** # HEBREW PUNCTUATION NUN HAFUKHA

This comes after a word and should not allow a break even if separated
by a space according to Peter Kirk (which would require a line break value
of EX). 
Mark Shoulson thinks it can occur in a bracketing way (i.e. start a line). 
If that is true, BA would be more appropriate, but Peter disaggrees

Change ** -> EX

ARABIC

As a result of a review of these properties on the bidi list, I propose to 
change these existing characters from AL to EX for 4.1.0

060C;AL->EX # ARABIC COMMA
061B;AL->EX # ARABIC SEMICOLON
061F;AL->EX # ARABIC QUESTION MARK
066A;AL->EX # ARABIC PERCENT SIGN
06D4;AL->EX # ARABIC FULL STOP

These are only used as sentence ending punctuation and are not used as part 
of numbers, which makes them similar to ! and ?

Kamal Mansour writes: "In traditional Arabic typography, one often sees spaces 
surrounding a punctuation mark such as comma or any of the others above. Over 
the past decade, DTP has somewhat reduced the frequency of this practice, but f
or the purpose of an algorithm, one couldn't count on the lack of white space
between a word and an adjoining punctuation mark. The situation for Arabic
would not be so different from French practice with regard to spacing around
punctuation."

This is precisely the reason for which the EX class was designed.

The effect of this would be to allow no linebreaks before these characters,
even if preceded by whitespace, and slightly more linebreaks after them, 
in particular if directly followed by letters or numbers.

The same rationale holds for this newly assigned character:

LineBreak-4.1.0d7.txt:061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK

THAI
(This has been updated from Rev. 2)

Kent Karlsson writes: "The following should have line break property BA (compare other dandas)"

0E2F;SA # THAI CHARACTER PAIYANNOI
0E5A;NS->BA # THAI CHARACTER ANGKHANKHU
0E5B;NS # THAI CHARACTER KHOMUT

Currently two of these are NS which act similar to BA, except that there is no break 
between a CL and an NS even if spaces intervene

Change: 0E5A, and 0E5B from NS to BA.
Defer on 0E2F since the case is not clear and the requestor did not supply
documentation.

ETHIOPIC
(This has been updated from Rev. 2)

LineBreak-4.1.0d7.txt:1360;AL # ETHIOPIC SECTION MARK

Daniel Yacob writes: "this is used like a dingbat, therefore AL is appropriate:
Only white space or another section mark should appear on a line
with a section mark.  Simulating the section mark with an asterisk,
example usage would be:

 :    :    :    :    :   :    :
Abcd efgh ijkl mnop qrst uvwx yz.

         *  *  *  *  *

Zyxw vuts rqpo nmlk jihg fedc ba.
 :    :    :    :    :   :    :                "

Change the LB category from ** to AL, but also change the
GC from Po to So to reflect that this is not used as a regular
punctuation character.

RUNIC
(This has been updated from Rev 2)

Based on a suggestion from Mattias Ellert,
change three characters as follows

16EB;AL->BA # RUNIC SINGLE PUNCTUATION
16EC;AL->BA # RUNIC MULTIPLE PUNCTUATION
16ED;AL->BA # RUNIC CROSS PUNCTUATION

These characters are used as word separators
like similar punctuation we have assigned BA.

KHMER
(This has been updated from Rev. 2)

Kent Karlsson writes: "The following should have line break property BA (compare other dandas)"
17D4;NS->BA # KHMER SIGN KHAN
17D5;BA # KHMER SIGN BARIYOOSAN
17D8;NS->BA # KHMER SIGN BEYYAL
17DA;NS->BA # KHMER SIGN KOOMUUT

Currently one of these is already BA and three of these are NS which act similar to BA, 
except that there is no break between a CL and an NS even if spaces intervene.

Change 17D4, 17D8 and 17DA from NS to BA. 

MONGOLIAN (FYI)
(This has been updated from Rev. 2)

The existing classification of Mongolian Punctuation is
unusual in that it classifies them all the same as letters.
This seems to be an oversight. However, there is not yet
conclusive evidence in favor of a better recommendation.

Current status:

1800;AL # MONGOLIAN BIRGA
1801;AL # MONGOLIAN ELLIPSIS
1802;AL # MONGOLIAN COMMA
1803;AL # MONGOLIAN FULL STOP
1804;AL # MONGOLIAN COLON
1805;AL # MONGOLIAN FOUR DOTS
1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
1808;AL # MONGOLIAN MANCHU COMMA
1809;AL # MONGOLIAN MANCHU FULL STOP

Andrew West Writes: "For 1802, 1803, 1808 and 1809 (Mongolian and Manchu 
commas/full stops) AL is definitely wrong. 

To my mind, which of IS, EX or BA is appropriate depends on
whether these punctuation marks must be separated from preceding and/or
following Mongolian text by a space character or not. I don't know enough about
Mongolian typography to answer that question, and line-breaking issues are not
addressed in Professor Choijinzhab's book, but my feeling is that these
punctuation marks need not be separated from preceding or following Mongolian
text by space characters, in which case neither IS nor EX would be appropriate
as they would inhibit line-breaking ... in certain circumstances. Thus I would
guess that BA is the most appropriate line-breaking class for these four
punctuation marks, as that would ensure that there is always a line-break
opportunity after them.

BA is probably also appropriate for 1805 (Mongolian four dots) and 1804
(Mongolian colon). Probably 1800 (birga) and 1801 (ellipsis) are OK as AL."

On 1807 (Sibe syllable boundary marker) there was a question whether it
should also become a BA, but Martin Hejdra was able to answer:
"just an explanation on the Sibe syllable marker: I think I finally
understand it's use, which is merely as any other letter in a few words
where a separate stroke is needed between syllables, probably vowels only
(the loanword zhuyi into Sibe comes to mind; but I also found necessity of
its use in a few Manchu cases.)

Therefore, it should normally not break before or after, just as any letter."

As a result the following are proposed:

1800;AL # MONGOLIAN BIRGA (unchanged)
1801;AL # MONGOLIAN ELLIPSIS (unchanged)
1802;AL-->BA # MONGOLIAN COMMA
1803;AL-->BA # MONGOLIAN FULL STOP
1804;AL-->BA # MONGOLIAN COLON
1805;AL-->BA# MONGOLIAN FOUR DOTS
1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER (unchanged)
1808;AL--> BA # MONGOLIAN MANCHU COMMA
1809;AL--> BA # MONGOLIAN MANCHU FULL STOP


NEW TAI LUE

LineBreak-4.1.0d7.txt:19DE;AL # NEW TAI LUE SIGN LAE
LineBreak-4.1.0d7.txt:19DF;AL # NEW TAI LUE SIGN LAEV

Their general category is given as "Po" in the proposal,
but that may be incorrect, as the proposal author states
categorically: "These are letters"

Change GC from Po --> Lo for 19DE and 19DF

BUGINESE

LineBreak-4.1.0d7.txt:1A1E;BA # BUGINESE PALLAWA
LineBreak-4.1.0d7.txt:1A1F;AL # BUGINESE END OF SECTION

Based on data from the proposal, I suggest we treat the first
as BA as the proposal states an analogy to period and comma
and the second as AL as it seems similar in use to the Paragraph
mark in the only examples shown.

Change 1A1E from ** to BA and 1A1F from ** to AL

SUPER/SUBSCRIPTS
(Updated from revision 2)

The digits in this block have line break property AL, not
NU since we took a deliberate action to not recognize
these are Nd.

Kent Karlsson suggests that the super/subscript digits be changed
back to Nd (decimal digit). He suggests to also create two pseudo
scripts to keep processing of digit strings of the same
kinds a possibility.

This seems a lot of effort. We should simply make it clear that
parsers are allowed to parse numerical expressions involving
characters that are not Nd.

Recommend: no change.

COPTIC

Detailed information on linebreak behavior is still lacking for these 
characters, but the following presents my best 'guess' of line
break property based on suspected analogy (mainly by name)
and the fact that Coptic also uses recently added General 
punctuation with similar behavior as proposed here.

LineBreak-4.1.0d7.txt:2CF9;BA # COPTIC OLD NUBIAN FULL STOP
LineBreak-4.1.0d7.txt:2CFA;BA # COPTIC OLD NUBIAN DIRECT QUESTION MARK
LineBreak-4.1.0d7.txt:2CFB;BA # COPTIC OLD NUBIAN INDIRECT QUESTION MARK
LineBreak-4.1.0d7.txt:2CFC;BA # COPTIC OLD NUBIAN VERSE DIVIDER
LineBreak-4.1.0d7.txt:2CFD;AL # COPTIC FRACTION ONE HALF
LineBreak-4.1.0d7.txt:2CFE;BA # COPTIC FULL STOP
LineBreak-4.1.0d7.txt:2CFF;BA # COPTIC MORPHOLOGICAL DIVIDER

Rationale: unless there's a need to support punctuation separated by
a space from the preceding letter, BA is a reasonable choice for
dividing or sentence ending punctuation. (Otherwise, EX might have
been preferable).

Using AL for the fraction keeps it together with numbers or words
without triggering special rules for numeric punctuation.

GREEK PUNCTUATION
(Updated from Rev 2)

Information is still lacking for this character:

LineBreak-4.1.0d7.txt:2E16;** # DOTTED RIGHT-POINTING ANGLE

If no other information comes forward during the UTC meeting I
suggest we treat this as AL. There seems to be no reason to let
it allow breaks after, and I would want confirmation before allowing
breaks before as a default.

(It's used as an editorial pointer or marker: "diple periestigmene").

This has now been confirmed; in addition, the character tends
to be used at the beginning of a line.

KHAROSHTHI
(This has been updated in Rev. 2)

Andrew suggests: " To summarize the ... script: line breaks may 
occur in any position except before a dependent sign, that is to say not between 
a sign and a combining vowel diacritic or other combining modifier 
(this is probably the same as with Devanagari, only Kharosthi is Right to Left). 
Breaks between consecutive numbers are avoided."

This would suggest that independent letters be treated as ID, not AL, however
in a second message Andrew suggested that for scholarly use, AL is the
better default, therefore no change from the beta. 
Also, the numbers should remain AL. (not NU, as that is reserved for
decimal digits that interact with decimal punctuation).

For punctuation he writes: "All punctuation signs should  break after the sign, 
so that the sign should not occur at the beginning of a line. The exception 
to this is 10A58 # KHAROSHTHI PUNCTUATION LINES, which only occurs at 
the beginning of a line, but in this case may be set off by a hard return."

This would most easily be accomplished by using BA and AL for the
"LINES"

LineBreak-4.1.0d7.txt:10A50;BA # KHAROSHTHI PUNCTUATION DOT
LineBreak-4.1.0d7.txt:10A51;BA # KHAROSHTHI PUNCTUATION SMALL CIRCLE
LineBreak-4.1.0d7.txt:10A52;BA # KHAROSHTHI PUNCTUATION CIRCLE
LineBreak-4.1.0d7.txt:10A53;BA # KHAROSHTHI PUNCTUATION CRESCENT BAR
LineBreak-4.1.0d7.txt:10A54;BA # KHAROSHTHI PUNCTUATION MANGALAM
LineBreak-4.1.0d7.txt:10A55;BA # KHAROSHTHI PUNCTUATION LOTUS
LineBreak-4.1.0d7.txt:10A58;AL # KHAROSHTHI PUNCTUATION LINES

The Kharoshthi digtis 
10A40;AL # KHAROSHTHI DIGIT ONE
10A41;AL # KHAROSHTHI DIGIT TWO
10A42;AL # KHAROSHTHI DIGIT THREE
10A43;AL # KHAROSHTHI DIGIT FOUR

are not a complete set of decimal digits, therefore they are correctly
given LB property AL (general letter and symbol).
As a result their general category should be No, not Nd as suggested
by Kent Karlsson.

Change GC from Nd to No.

YI

As a result of changing general category for this character, its LB property
was adjusted from ID to NS in analogy to U+3005. I'm noting this here to 
make sure that this is covered by a UTC decision (as I didn't see anything
for this in the minutes of the last meeting)

A015;ID->NS # YI SYLLABLE WU

TIBETAN

There are potentially some issues with the Tibetan line break properties 
as currently assigned in the standard. The beta file makes some changes, 
the text of UAX#14 suggests some additional changes. These need to be reconciled.

[I plan to review this issue and provide a revision of this document].


OTHER

All other changes of Linebreak properties for 4.1.0 relative to 4.0.1 are 
already documented both in the beta data file and in the proposed 
update for UAX#14 that's been out for review.

By default, for newly encoded characters:
o all Letters and ordinary symbols are given AL
o all decimal digits are given NU
o all combining marks are given CM
o currency symbols are given PO or PR (postfix or prefix)
o all brackets are given OP or CL (open or close)
o most sentence or phrase-ending punctuation is given BA
o all ambiguous quotation marks are given QU
o all wide characters are given ID (ideographic)

Where a clear analog exists in another script, the default
assignment would be to match.

The current document discusses only those cases where a different
choice was made or the exact behavior of a character was 
not clear from the outset.

Note: Some currency symbols can be either postfix or prefix for the
same character code. This is currently not handled in the default
algorithm.

[END]