L2/04-215 Date: June 7, 2004 Subject: Proposed changes to UAX 29 Text Boundaries Authors: Deborah Goldsmith (Apple Computer), Mark Davis (IBM) Feedback from end users has led us to propose some alterations to the word break rules detailed in section 4 of UAX 29, "Text Boundaries." In particular, the following issues have been raised frequently: 1. Non-inclusion of U+005F LOW LINE as a word extension character. This affects editing of programming language identifiers, which often use LOW LINE as a way of including multiple words in a single identifier. LOW LINE is not used in natural language, and there are no other issues we are aware of that would argue against changing the behavior. 2. Inclusion of U+002E FULL STOP as a word extension character. This affects editing of file names containing extensions, e.g. "filename.txt". FULL STOP is treated as a word extension character to deal with abbreviations, such as Ph.D. or T.G.I.F. At first we considered changes to UAX 29 that would distinguish between FULL STOP when used as a word separator and when used in abbreviations, but we discovered some issues with the existing behavior. For example, the current behavior would break Ph.D. as Ph.D|.| This is difficult to correct as in general there is no way to determine whether the final period in an abbreviation also ends a sentence. While it might be possible to come up with a set of heuristics that worked well in most cases and that distinguished between use of FULL STOP in abbreviations vs. usage as a word separator and sentence terminator, our conclusion was that the abbreviation support is not working well enough to be worth keeping, and thus we propose treating FULL STOP as a word separator. 3. Inclusion of U+003A COLON as a word extension character. This also affects editing in programming language contexts. COLON is currently treated as a word extension character to handle the case of abbreviations in Swedish. Swedish can be handled via language-specific tailoring, or the rules proposed below could be extended to use a heuristic to differentiate between usage in abbreviations and in other contexts. Therefore, we propose the following changes to UAX 29: A. Remove MidNumLet as a class. B. Change rules (6), (7), (11), and (12) to remove MidNumLet: ALetter x MidLetter ALetter (6) ALetter MidLetter x ALetter (7) Numeric MidNum x Numeric (11) Numeric x MidNum Numeric (12) C. Add ExtendNumLet, with contents to be: General_Category=Connector_Punctuation, excluding U+30FB KATAKANA MIDDLE DOT and U+FF65 HALFWIDTH KATAKANA MIDDLE DOT, i.e.: 005F ; Pc # LOW LINE 203F..2040 ; Pc # [2] UNDERTIE..CHARACTER TIE 2054 ; Pc # INVERTED UNDERTIE FE33..FE34 ; Pc # [2] PRESENTATION FORM FOR VERTICAL LOW LINE..PRESENTATION FORM FOR VERTICAL WAVY LOW LINE FE4D..FE4F ; Pc # [3] DASHED LOW LINE..WAVY LOW LINE FF3F ; Pc # FULLWIDTH LOW LINE D. Add two new rules: (ALetter | Numeric | Katakana) x ExtendNumLet (X) ExtendNumLet x (ALetter | Numeric | Katakana) (Y)