L2/05-062 Proposed Updates To 4.1.0 Linebreak Properties Author: Asmus Freytag Date: 2005-02-06 Revision: 2 This document contains my summary of how to resolve beta feedback on line break properties, including information from the discussion with various proposal submitters and others with expert knowledge. A designation "LineBreak-4.1.0d7.txt:" means that no property had been assigned up to that point (using a ** in the data file as placeholder). All other characters mentioned here have had tentative beta assignments or are existing characters (pre 4.1.0). Compared to the original revision of this document which was circulated on the unicore list, the HEBREW, KHAROSHTHI and OTHER sections have been substantially revised, based on feedback received. HEBREW LineBreak-4.1.0d7.txt:05C6;** # HEBREW PUNCTUATION NUN HAFUKHA This comes after a word and should not allow a break even if separated by a space according to Peter Kirk (which would require a line break value of EX). Mark Shoulson thinks it can occur in a bracketing way (i.e. start a line). If that is true, BA would be more appropriate. Change ** -> BA ARABIC As a result of a review of these properties on the bidi list, I propose to change these existing characters from AL to EX for 4.1.0 060C;AL->EX # ARABIC COMMA 061B;AL->EX # ARABIC SEMICOLON 061F;AL->EX # ARABIC QUESTION MARK 066A;AL->EX # ARABIC PERCENT SIGN 06D4;AL->EX # ARABIC FULL STOP These are only used as sentence ending punctuation and are not used as part of numbers, which makes them similar to ! and ? Kamal Mansour writes: "In traditional Arabic typography, one often sees spaces surrounding a punctuation mark such as comma or any of the others above. Over the past decade, DTP has somewhat reduced the frequency of this practice, but f or the purpose of an algorithm, one couldn't count on the lack of white space between a word and an adjoining punctuation mark. The situation for Arabic would not be so different from French practice with regard to spacing around punctuation." This is precisely the reason for which the EX class was designed. The effect of this would be to allow no linebreaks before these characters, even if preceded by whitespace, and slightly more linebreaks after them, in particular if directly followed by letters or numbers. The same rationale holds for this newly assigned character: LineBreak-4.1.0d7.txt:061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK ETHIOPIC LineBreak-4.1.0d7.txt:1360;AL # ETHIOPIC SECTION MARK Daniel Yacob writes: "this is used like a dingbat, therefore AL is appropriate: Only white space or another section mark should appear on a line with a section mark. Simulating the section mark with an asterisk, example usage would be: : : : : : : : Abcd efgh ijkl mnop qrst uvwx yz. * * * * * Zyxw vuts rqpo nmlk jihg fedc ba. : : : : : : : " MONGOLIAN (FYI) The existing classification of Mongolian Punctuation is unusual in that it classifies them all the same as letters. This seems to be an oversight. However, there is not yet conclusive evidence in favor of a better recommendation. Current status: 1800;AL # MONGOLIAN BIRGA 1801;AL # MONGOLIAN ELLIPSIS 1802;AL # MONGOLIAN COMMA 1803;AL # MONGOLIAN FULL STOP 1804;AL # MONGOLIAN COLON 1805;AL # MONGOLIAN FOUR DOTS 1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER 1808;AL # MONGOLIAN MANCHU COMMA 1809;AL # MONGOLIAN MANCHU FULL STOP I would have expected to see EX, or even IS for most of these, or potentially BA. NEW TAI LUE LineBreak-4.1.0d7.txt:19DE;AL # NEW TAI LUE SIGN LAE LineBreak-4.1.0d7.txt:19DF;AL # NEW TAI LUE SIGN LAEV Their general category is given as "Po" in the proposal, but that may be incorrect, as the proposal author states categorically: "These are letters" Change GC from Po --> Lo BUGINESE LineBreak-4.1.0d7.txt:1A1E;BA # BUGINESE PALLAWA LineBreak-4.1.0d7.txt:1A1F;AL # BUGINESE END OF SECTION Based on data from the proposal, I suggest we treat the first as BA as the proposal states an analogy to period and comma and the second as AL as it seems similar in use to the Paragraph mark in the only examples shown. COPTIC Detailed information on linebreak behavior is still lacking for these characters, but the following presents my best 'guess' of line break property based on suspected analogy (mainly by name) and the fact that Coptic also uses recently added General punctuation with similar behavior as proposed here. LineBreak-4.1.0d7.txt:2CF9;BA # COPTIC OLD NUBIAN FULL STOP LineBreak-4.1.0d7.txt:2CFA;BA # COPTIC OLD NUBIAN DIRECT QUESTION MARK LineBreak-4.1.0d7.txt:2CFB;BA # COPTIC OLD NUBIAN INDIRECT QUESTION MARK LineBreak-4.1.0d7.txt:2CFC;BA # COPTIC OLD NUBIAN VERSE DIVIDER LineBreak-4.1.0d7.txt:2CFD;AL # COPTIC FRACTION ONE HALF LineBreak-4.1.0d7.txt:2CFE;BA # COPTIC FULL STOP LineBreak-4.1.0d7.txt:2CFF;BA # COPTIC MORPHOLOGICAL DIVIDER Rationale: unless there's a need to support punctuation separated by a space from the preceding letter, BA is a reasonable choice for dividing or sentence ending punctuation. (Otherwise, EX might have been preferable). Using AL for the fraction keeps it together with numbers or words without triggering special rules for numeric punctuation. GREEK PUNCTUATION Information is still lacking for this character: LineBreak-4.1.0d7.txt:2E16;** # DOTTED RIGHT-POINTING ANGLE If no other information comes forward during the UTC meeting I suggest we treat this as AL. There seems to be no reason to let it allow breaks after, and I would want confirmation before allowing breaks before as a default. (It's used as an editorial pointer or marker: "diple periestigmene"). KHAROSHTHI Andrew suggests: " To summarize the ... script: line breaks may occur in any position except before a dependent sign, that is to say not between a sign and a combining vowel diacritic or other combining modifier (this is probably the same as with Devanagari, only Kharosthi is Right to Left). Breaks between consecutive numbers are avoided." This would suggest that independent letters be treated as ID, not AL, however in a second message Andrew suggested that for scholarly use, AL is the better default, therefore no change from the beta. Also, the numbers should remain AL. (not NU, as that is reserved for decimal digits that interact with decimal punctuation). For punctuation he writes: "All punctuation signs should break after the sign, so that the sign should not occur at the beginning of a line. The exception to this is 10A58 # KHAROSHTHI PUNCTUATION LINES, which only occurs at the beginning of a line, but in this case may be set off by a hard return." This would most easily be accomplished by using BA and AL for the "LINES" LineBreak-4.1.0d7.txt:10A50;BA # KHAROSHTHI PUNCTUATION DOT LineBreak-4.1.0d7.txt:10A51;BA # KHAROSHTHI PUNCTUATION SMALL CIRCLE LineBreak-4.1.0d7.txt:10A52;BA # KHAROSHTHI PUNCTUATION CIRCLE LineBreak-4.1.0d7.txt:10A53;BA # KHAROSHTHI PUNCTUATION CRESCENT BAR LineBreak-4.1.0d7.txt:10A54;BA # KHAROSHTHI PUNCTUATION MANGALAM LineBreak-4.1.0d7.txt:10A55;BA # KHAROSHTHI PUNCTUATION LOTUS LineBreak-4.1.0d7.txt:10A58;AL # KHAROSHTHI PUNCTUATION LINES YI As a result of changing general category for this character, its LB property was adjusted from ID to NS in analogy to U+3005. I'm noting this here to make sure that this is covered by a UTC decision (as I didn't see anything for this in the minutes of the last meeting) A015;ID->NS # YI SYLLABLE WU TIBETAN There are potentially some issues with the Tibetan line break properties as currently assigned in the standard. The beta file makes some changes, the text of UAX#14 suggests some additional changes. These need to be reconciled. [I plan to review this issue and provide a revision of this document]. OTHER All other changes of Linebreak properties for 4.1.0 relative to 4.0.1 are already documented both in the beta data file and in the proposed update for UAX#14 that's been out for review. By default, for newly encoded characters: o all Letters and ordinary symbols are given AL o all decimal digits are given NU o all combining marks are given CM o currency symbols are given PO or PR (postfix or prefix) o all brackets are given OP or CL (open or close) o most sentence or phrase-ending punctuation is given BA o all ambiguous quotation marks are given QU o all wide characters are given ID (ideographic) Where a clear analog exists in another script, the default assignment would be to match. The current document discusses only those cases where a different choice was made or the exact behavior of a character was not clear from the outset.