L2/07-370

Date: October 13, 2007
Source: Mark Davis
Subject: Word/Sentence punctuation properties

----------------------

UAX#29 defines word boundaries in terms of certain properties. For comparison, I ran a program that closed the property values under compatibility equivalence. While there is no formal requirement that we obey compatibility equivalence in word boundaries, it was worth reviewing these values to ensure that we weren't missing something -- I also compared to CLDR data.

In doing so, I came up with some recommendations for enhancements for two property values

1. [:Word_Break=MidLetter:]
These are characters that continue an alphabetic word: for example, "can't" counts as a single word.
 

0027 ( ' ) APOSTROPHE
003A ( : ) COLON
00B7 ( · ) MIDDLE DOT
05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM
2018 ( ' ) LEFT SINGLE QUOTATION MARK
2019 ( ' ) RIGHT SINGLE QUOTATION MARK
2027 ( ‧ ) HYPHENATION POINT
 

Compatibility closure would add [\u0387\uFE13\uFE55\uFF07\uFF1A]

 
0387 ( · ) GREEK ANO TELEIA
FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
FE55 ( ﹕ ) SMALL COLON
FF07 ( ' ) FULLWIDTH APOSTROPHE
FF1A ( : ) FULLWIDTH COLON
 

We should consider adding these to MidLetter. Addition of these shouldn't hurt anything, and makes the format more neutral regarding compatibility equivalence. Even the fullwidth variants should be safe, since these characters only bridge non-alphabetic characters in words, so if either of the surrounding characters are ideographs, this will have no effect.

-------------------------

2. [:Word_Break=MidNum:]
These are characters that continue a numeric word: for example, "99.99" counts as a single word.
This is actually derived from Line_Break = Infix_Numeric and not U+003A (:) COLON

 
002C ( , ) COMMA
002E ( . ) FULL STOP
003B ( ; ) SEMICOLON
037E ( ; ) GREEK QUESTION MARK
0589 ( ։ ) ARMENIAN FULL STOP
060D ( ؍ ) ARABIC DATE SEPARATOR
07F8 ( ߸ ) NKO COMMA
2044 ( ⁄ ) FRACTION SLASH
FE10 ( ︐ ) PRESENTATION FORM FOR VERTICAL COMMA
FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
FE14 ( ︔ ) PRESENTATION FORM FOR VERTICAL SEMICOLON
 

Compatibility closure would add [\:\u2024\uFE50\uFE52\uFE54\uFE55\uFF0C\uFF0E\uFF1A\uFF1B]

 
003A ( : ) COLON
2024 ( ․ ) ONE DOT LEADER
FE50 ( ﹐ ) SMALL COMMA
FE52 ( ﹒ ) SMALL FULL STOP
FE54 ( ﹔ ) SMALL SEMICOLON
FE55 ( ﹕ ) SMALL COLON
FF0C ( , ) FULLWIDTH COMMA
FF0E ( . ) FULLWIDTH FULL STOP
FF1A ( : ) FULLWIDTH COLON
FF1B ( ; ) FULLWIDTH SEMICOLON
 

What this reveals is that the PRESENTATION FORM FOR VERTICAL COLON is a mistake; I propose that this be removed from the derivation.

If we remove that from MidNum, we get the more reasonable set of additions, which look safe to add. So we should consider adding them.

 
2024 ( ․ ) ONE DOT LEADER
FE50 ( ﹐ ) SMALL COMMA
FE52 ( ﹒ ) SMALL FULL STOP
FE54 ( ﹔ ) SMALL SEMICOLON
FF0C ( , ) FULLWIDTH COMMA
FF0E ( . ) FULLWIDTH FULL STOP
FF1B ( ; ) FULLWIDTH SEMICOLON
 

-------------------------

Additional issues with MidNum

3. MidNum is missing the following characters that CLDR has as numeric separators:

 
0027 ( ' ) APOSTROPHE
066C ( ٬ ) ARABIC THOUSANDS SEPARATOR
 

Adding the second one is unproblematical, and we should simply do it.

Adding the first one requires a bit more of a change, but one that would be good for other reasons. Right now, we don't recognize interior periods, so we break " U.S.A." into separate words when it shouldn't be. So we should consider the following change:

A. Adding a property MidNumLet, where we move:

 
0027 ( ' ) APOSTROPHE
002E ( . ) FULL STOP
2018 ( ' ) LEFT SINGLE QUOTATION MARK
2019 ( ' ) RIGHT SINGLE QUOTATION MARK
2024 ( ․ ) ONE DOT LEADER
FE52 ( ﹒ ) SMALL FULL STOP
FF07 ( ' ) FULLWIDTH APOSTROPHE
FF0E ( . ) FULLWIDTH FULL STOP

 
B. In the rules, allow these characters to bridge both alphabetic and numeric words, with:
 
-------------------------

4. In addition, the following are also sometimes used, or could be used, as numeric separators (we don't give much guidance as to the best choice in the standard):

 
0020 ( ) SPACE
00A0 (   ) NO-BREAK SPACE
2007 (   ) FIGURE SPACE
2008 (   ) PUNCTUATION SPACE
2009 (   ) THIN SPACE
202F (   ) NARROW NO-BREAK SPACE
 

If we had good reason to believe that if one of these only really occurred between digits in a single number, then we could add it. I don't have enough information to feel like a proposal for that is warranted, but others may. Short of that, we should at least document in the notes that some implementations may want to tailor MidNum to add some of these.

-------------------------

5. We should also consider adding the alternative forms of full stop to SentenceBreak=Aterm, ending up with:

 
002E ( . ) FULL STOP
2024 ( ․ ) ONE DOT LEADER
FE52 ( ﹒ ) SMALL FULL STOP
FF0E ( . ) FULLWIDTH FULL STOP