PROPOSED UPDATES TO 4.1.0 LINEBREAK PROPERTIES Author: Asmus Freytag Date: 2005-02-06 Revision: 2 L2/05-062 Date: 2005-02-08 Revision: 3 L2/05-062R This document contains my summary of how to resolve beta feedback on line break properties, including information from the discussion with various proposal submitters and others with expert knowledge. It also includes some proposed changes to GC for consistency. A designation "LineBreak-4.1.0d7.txt:" means that no property had been assigned up to that point (using a ** in the data file as placeholder). All other characters mentioned here have had tentative beta assignments or are existing characters (pre 4.1.0). CONTROL CODES (This has been updated from Rev. 2) Kent Karlsson writes: "the following should be listed as BK...since at least the bidi alg. considers them to be paragraph boundaries." 000B;CM->BK # tab 000C;BK # ff 001C;CM->BK # information separator four 001D;CM->BK # information separator three 001E;CM->BK # information separator two 0085;NL->BK # next line We should not make TAB a mandatory line break, since it doesn't break lines. We should not make the IS2-4 mandatory line breaks as they are not part of the set of characters used in text files which need to be line broken. The classes NL and BK work the same, so it would be possible to remove NL, but it doesn't really matter. Recommend: no change. We should reject his related comment to delete name comments in UnicodeData.txt for control codes. HEBREW (This has been updated from Rev. 2) LineBreak-4.1.0d7.txt:05C6;** # HEBREW PUNCTUATION NUN HAFUKHA This comes after a word and should not allow a break even if separated by a space according to Peter Kirk (which would require a line break value of EX). Mark Shoulson thinks it can occur in a bracketing way (i.e. start a line). If that is true, BA would be more appropriate, but Peter disaggrees Change ** -> EX ARABIC As a result of a review of these properties on the bidi list, I propose to change these existing characters from AL to EX for 4.1.0 060C;AL->EX # ARABIC COMMA 061B;AL->EX # ARABIC SEMICOLON 061F;AL->EX # ARABIC QUESTION MARK 066A;AL->EX # ARABIC PERCENT SIGN 06D4;AL->EX # ARABIC FULL STOP These are only used as sentence ending punctuation and are not used as part of numbers, which makes them similar to ! and ? Kamal Mansour writes: "In traditional Arabic typography, one often sees spaces surrounding a punctuation mark such as comma or any of the others above. Over the past decade, DTP has somewhat reduced the frequency of this practice, but f or the purpose of an algorithm, one couldn't count on the lack of white space between a word and an adjoining punctuation mark. The situation for Arabic would not be so different from French practice with regard to spacing around punctuation." This is precisely the reason for which the EX class was designed. The effect of this would be to allow no linebreaks before these characters, even if preceded by whitespace, and slightly more linebreaks after them, in particular if directly followed by letters or numbers. The same rationale holds for this newly assigned character: LineBreak-4.1.0d7.txt:061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK THAI (This has been updated from Rev. 2) Kent Karlsson writes: "The following should have line break property BA (compare other dandas)" 0E2F;SA # THAI CHARACTER PAIYANNOI 0E5A;NS->BA # THAI CHARACTER ANGKHANKHU 0E5B;NS # THAI CHARACTER KHOMUT Currently two of these are NS which act similar to BA, except that there is no break between a CL and an NS even if spaces intervene Change: 0E5A, and 0E5B from NS to BA. Defer on 0E2F since the case is not clear and the requestor did not supply documentation. ETHIOPIC (This has been updated from Rev. 2) LineBreak-4.1.0d7.txt:1360;AL # ETHIOPIC SECTION MARK Daniel Yacob writes: "this is used like a dingbat, therefore AL is appropriate: Only white space or another section mark should appear on a line with a section mark. Simulating the section mark with an asterisk, example usage would be: : : : : : : : Abcd efgh ijkl mnop qrst uvwx yz. * * * * * Zyxw vuts rqpo nmlk jihg fedc ba. : : : : : : : " Change the LB category from ** to AL, but also change the GC from Po to So to reflect that this is not used as a regular punctuation character. RUNIC (This has been updated from Rev 2) Based on a suggestion from Mattias Ellert, change three characters as follows 16EB;AL->BA # RUNIC SINGLE PUNCTUATION 16EC;AL->BA # RUNIC MULTIPLE PUNCTUATION 16ED;AL->BA # RUNIC CROSS PUNCTUATION These characters are used as word separators like similar punctuation we have assigned BA. KHMER (This has been updated from Rev. 2) Kent Karlsson writes: "The following should have line break property BA (compare other dandas)" 17D4;NS->BA # KHMER SIGN KHAN 17D5;BA # KHMER SIGN BARIYOOSAN 17D8;NS->BA # KHMER SIGN BEYYAL 17DA;NS->BA # KHMER SIGN KOOMUUT Currently one of these is already BA and three of these are NS which act similar to BA, except that there is no break between a CL and an NS even if spaces intervene. Change 17D4, 17D8 and 17DA from NS to BA. MONGOLIAN (FYI) (This has been updated from Rev. 2) The existing classification of Mongolian Punctuation is unusual in that it classifies them all the same as letters. This seems to be an oversight. However, there is not yet conclusive evidence in favor of a better recommendation. Current status: 1800;AL # MONGOLIAN BIRGA 1801;AL # MONGOLIAN ELLIPSIS 1802;AL # MONGOLIAN COMMA 1803;AL # MONGOLIAN FULL STOP 1804;AL # MONGOLIAN COLON 1805;AL # MONGOLIAN FOUR DOTS 1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER 1808;AL # MONGOLIAN MANCHU COMMA 1809;AL # MONGOLIAN MANCHU FULL STOP Andrew West Writes: "For 1802, 1803, 1808 and 1809 (Mongolian and Manchu commas/full stops) AL is definitely wrong. To my mind, which of IS, EX or BA is appropriate depends on whether these punctuation marks must be separated from preceding and/or following Mongolian text by a space character or not. I don't know enough about Mongolian typography to answer that question, and line-breaking issues are not addressed in Professor Choijinzhab's book, but my feeling is that these punctuation marks need not be separated from preceding or following Mongolian text by space characters, in which case neither IS nor EX would be appropriate as they would inhibit line-breaking ... in certain circumstances. Thus I would guess that BA is the most appropriate line-breaking class for these four punctuation marks, as that would ensure that there is always a line-break opportunity after them. BA is probably also appropriate for 1805 (Mongolian four dots) and 1804 (Mongolian colon). Probably 1800 (birga) and 1801 (ellipsis) are OK as AL." On 1807 (Sibe syllable boundary marker) there was a question whether it should also become a BA, but Martin Hejdra was able to answer: "just an explanation on the Sibe syllable marker: I think I finally understand it's use, which is merely as any other letter in a few words where a separate stroke is needed between syllables, probably vowels only (the loanword zhuyi into Sibe comes to mind; but I also found necessity of its use in a few Manchu cases.) Therefore, it should normally not break before or after, just as any letter." As a result the following are proposed: 1800;AL # MONGOLIAN BIRGA (unchanged) 1801;AL # MONGOLIAN ELLIPSIS (unchanged) 1802;AL-->BA # MONGOLIAN COMMA 1803;AL-->BA # MONGOLIAN FULL STOP 1804;AL-->BA # MONGOLIAN COLON 1805;AL-->BA# MONGOLIAN FOUR DOTS 1807;AL # MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER (unchanged) 1808;AL--> BA # MONGOLIAN MANCHU COMMA 1809;AL--> BA # MONGOLIAN MANCHU FULL STOP NEW TAI LUE LineBreak-4.1.0d7.txt:19DE;AL # NEW TAI LUE SIGN LAE LineBreak-4.1.0d7.txt:19DF;AL # NEW TAI LUE SIGN LAEV Their general category is given as "Po" in the proposal, but that may be incorrect, as the proposal author states categorically: "These are letters" Change GC from Po --> Lo for 19DE and 19DF BUGINESE LineBreak-4.1.0d7.txt:1A1E;BA # BUGINESE PALLAWA LineBreak-4.1.0d7.txt:1A1F;AL # BUGINESE END OF SECTION Based on data from the proposal, I suggest we treat the first as BA as the proposal states an analogy to period and comma and the second as AL as it seems similar in use to the Paragraph mark in the only examples shown. Change 1A1E from ** to BA and 1A1F from ** to AL SUPER/SUBSCRIPTS (Updated from revision 2) The digits in this block have line break property AL, not NU since we took a deliberate action to not recognize these are Nd. Kent Karlsson suggests that the super/subscript digits be changed back to Nd (decimal digit). He suggests to also create two pseudo scripts to keep processing of digit strings of the same kinds a possibility. This seems a lot of effort. We should simply make it clear that parsers are allowed to parse numerical expressions involving characters that are not Nd. Recommend: no change. COPTIC Detailed information on linebreak behavior is still lacking for these characters, but the following presents my best 'guess' of line break property based on suspected analogy (mainly by name) and the fact that Coptic also uses recently added General punctuation with similar behavior as proposed here. LineBreak-4.1.0d7.txt:2CF9;BA # COPTIC OLD NUBIAN FULL STOP LineBreak-4.1.0d7.txt:2CFA;BA # COPTIC OLD NUBIAN DIRECT QUESTION MARK LineBreak-4.1.0d7.txt:2CFB;BA # COPTIC OLD NUBIAN INDIRECT QUESTION MARK LineBreak-4.1.0d7.txt:2CFC;BA # COPTIC OLD NUBIAN VERSE DIVIDER LineBreak-4.1.0d7.txt:2CFD;AL # COPTIC FRACTION ONE HALF LineBreak-4.1.0d7.txt:2CFE;BA # COPTIC FULL STOP LineBreak-4.1.0d7.txt:2CFF;BA # COPTIC MORPHOLOGICAL DIVIDER Rationale: unless there's a need to support punctuation separated by a space from the preceding letter, BA is a reasonable choice for dividing or sentence ending punctuation. (Otherwise, EX might have been preferable). Using AL for the fraction keeps it together with numbers or words without triggering special rules for numeric punctuation. GREEK PUNCTUATION (Updated from Rev 2) Information is still lacking for this character: LineBreak-4.1.0d7.txt:2E16;** # DOTTED RIGHT-POINTING ANGLE If no other information comes forward during the UTC meeting I suggest we treat this as AL. There seems to be no reason to let it allow breaks after, and I would want confirmation before allowing breaks before as a default. (It's used as an editorial pointer or marker: "diple periestigmene"). This has now been confirmed; in addition, the character tends to be used at the beginning of a line. KHAROSHTHI (This has been updated in Rev. 2) Andrew suggests: " To summarize the ... script: line breaks may occur in any position except before a dependent sign, that is to say not between a sign and a combining vowel diacritic or other combining modifier (this is probably the same as with Devanagari, only Kharosthi is Right to Left). Breaks between consecutive numbers are avoided." This would suggest that independent letters be treated as ID, not AL, however in a second message Andrew suggested that for scholarly use, AL is the better default, therefore no change from the beta. Also, the numbers should remain AL. (not NU, as that is reserved for decimal digits that interact with decimal punctuation). For punctuation he writes: "All punctuation signs should break after the sign, so that the sign should not occur at the beginning of a line. The exception to this is 10A58 # KHAROSHTHI PUNCTUATION LINES, which only occurs at the beginning of a line, but in this case may be set off by a hard return." This would most easily be accomplished by using BA and AL for the "LINES" LineBreak-4.1.0d7.txt:10A50;BA # KHAROSHTHI PUNCTUATION DOT LineBreak-4.1.0d7.txt:10A51;BA # KHAROSHTHI PUNCTUATION SMALL CIRCLE LineBreak-4.1.0d7.txt:10A52;BA # KHAROSHTHI PUNCTUATION CIRCLE LineBreak-4.1.0d7.txt:10A53;BA # KHAROSHTHI PUNCTUATION CRESCENT BAR LineBreak-4.1.0d7.txt:10A54;BA # KHAROSHTHI PUNCTUATION MANGALAM LineBreak-4.1.0d7.txt:10A55;BA # KHAROSHTHI PUNCTUATION LOTUS LineBreak-4.1.0d7.txt:10A58;AL # KHAROSHTHI PUNCTUATION LINES The Kharoshthi digtis 10A40;AL # KHAROSHTHI DIGIT ONE 10A41;AL # KHAROSHTHI DIGIT TWO 10A42;AL # KHAROSHTHI DIGIT THREE 10A43;AL # KHAROSHTHI DIGIT FOUR are not a complete set of decimal digits, therefore they are correctly given LB property AL (general letter and symbol). As a result their general category should be No, not Nd as suggested by Kent Karlsson. Change GC from Nd to No. YI As a result of changing general category for this character, its LB property was adjusted from ID to NS in analogy to U+3005. I'm noting this here to make sure that this is covered by a UTC decision (as I didn't see anything for this in the minutes of the last meeting) A015;ID->NS # YI SYLLABLE WU TIBETAN There are potentially some issues with the Tibetan line break properties as currently assigned in the standard. The beta file makes some changes, the text of UAX#14 suggests some additional changes. These need to be reconciled. [I plan to review this issue and provide a revision of this document]. OTHER All other changes of Linebreak properties for 4.1.0 relative to 4.0.1 are already documented both in the beta data file and in the proposed update for UAX#14 that's been out for review. By default, for newly encoded characters: o all Letters and ordinary symbols are given AL o all decimal digits are given NU o all combining marks are given CM o currency symbols are given PO or PR (postfix or prefix) o all brackets are given OP or CL (open or close) o most sentence or phrase-ending punctuation is given BA o all ambiguous quotation marks are given QU o all wide characters are given ID (ideographic) Where a clear analog exists in another script, the default assignment would be to match. The current document discusses only those cases where a different choice was made or the exact behavior of a character was not clear from the outset. Note: Some currency symbols can be either postfix or prefix for the same character code. This is currently not handled in the default algorithm. [END]