L2/04-405

Re TR29 Corrections
From: Mark Davis
Date: 2004-11-11

Action 99-A57 was erroneously marked as done, and thus the changes that it encompassed did not make it into the posted proposed update of UAX 29. The following are the extracted portions of the UAX that are need to be changed so as to make the changes in 99-A57. In addition, the generation of the property files as per the UTC decision revealed cases where the properties were not orthogonal as defined, so their definitions needed to be adjusted.

Note that 99-A57 was created before the Katakana_or_hiragana script value was withdrawn, so the action had to be reinterpreted in that light.

This needs to be incorporated into a new public review of the UAX for Unicode 4.1.


Table 2. Default Word Boundaries

Boundary Property Values
Format General_Category = Format (Cf)
and not
U+200C ZERO WIDTH NON-JOINER (ZWNJ)
and not U+200D ZERO WIDTH JOINER (ZWJ)
Katakana Script = KATAKANA, or
Any of the following:
U+3031 (〱) VERTICAL KANA REPEAT MARK
U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF
U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF
U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN

U+30FC (
) KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF70 (
) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF9E (
) HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F (
) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
ALetter Alphabetic = true, or
U+00A0 ( ) NO-BREAK SPACE (NBSP) or
U+05F3 (׳) HEBREW PUNCTUATION GERESH
and not Ideographic = true
and not Katakana = true
and not
Script = Thai
and not Script = Lao
and not Script = Hiragana
and not GRAPHEME EXTEND = true
MidLetter Any of the following:
U+0027 (') APOSTROPHE
U+00B7 (
·) MIDDLE DOT
U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM

U+2019 () RIGHT SINGLE QUOTATION MARK (curly apostrophe)
U+2027 (
) HYPHENATION POINT
U+003A (:) COLON (used in Swedish)
MidNumLet Any of the following:
U+002E (.) FULL STOP (period)
U+003A (:) COLON (used in Swedish)
MidNum Line_Break = Infix_Numeric
and not MidNumLet = true
and not U+003A (:) COLON
Numeric Line_Break = Numeric
ExtendNumLet General_Category=Connector_Punctuation
and not U+30FB KATAKANA MIDDLE DOT
and not U+FF65 HALFWIDTH KATAKANA MIDDLE DOT
Any Any character (includes all of the above)

 

Boundary Rules

Assign each code point with line break property values of CB, SA, SG, and XX to one of the above boundary property values depending on criteria outside the scope of this algorithm. Characters with other line break properties are assigned values directly according to the above table.

(0)

Break at the start and end of text.

sot ÷ (1)
÷ eot (2)

Treat a grapheme cluster as if it were a single character: the first character of the cluster.

GC

FC (3)

Ignore trailing Format characters. That is, ignore Format characters in all subsequent rules (except the last rule).

X Format* X (4)

Do not break between most letters.

ALetter × ALetter (5)

Do not break letters across certain punctuation.

ALetter × (MidLetter | MidNumLet) ALetter (6)
ALetter (MidLetter | MidNumLet) × ALetter (7)

Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3').

Numeric × Numeric (8)
ALetter × Numeric (9)
Numeric × ALetter (10)

Do not break within sequences like: ‘3.2’ or '3,456.789'.

Numeric (MidNum | MidNumLet) × Numeric (11)
Numeric × (MidNum | MidNumLet) Numeric (12)

Do not break between Katakana.

Katakana × Katakana (13)
Do not break from extenders
(ALetter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet (13a)
ExtendNumLet × (ALetter | Numeric | Katakana) (13b)

Otherwise, break everywhere (including around ideographs).

Any ÷ Any (14)

 


Table 3. Default Sentence Boundaries

Boundary Property Values
Sep Any of the following characters:
U+000A LINE FEED (LF)
U+000D CARRIAGE RETURN (CR)
U+0085 NEXT LINE (NEL)
U+2028 LINE SEPARATOR (LS)
U+2029 PARAGRAPH SEPARATOR (PS)
Format General_Category = Format (Cf)
and not
U+200C ZERO WIDTH NON-JOINER (ZWNJ)
and not U+200D ZERO WIDTH JOINER (ZWJ)
Sp Whitespace = true
and not Sep = true
and not U+00A0 ( ) NO-BREAK SPACE (NBSP)
Lower Lowercase = true
and not GRAPHEME EXTEND = true
Upper General_Category = Titlecase_Letter (Lt), or
Uppercase = true
OLetter Alphabetic = true, or
U+00A0 ( ) NO-BREAK SPACE (NBSP), or
U+05F3 (׳) HEBREW PUNCTUATION GERESH
and not Lower = true
and not Upper = true
and not GRAPHEME EXTEND = true
Numeric Linebreak = Numeric (NU)
ATerm Any of the following characters:
U+002E (.) FULL STOP
STerm STerm = true
and not ATerm = true
Close General_Category = Open_Punctuation (Po), or
General_Category = Close_Punctuation (Pe), or
Linebreak = Quotation (QU)
and not U+05F3 (׳) HEBREW PUNCTUATION GERESH
and not ATerm=true
and not STerm = true
Any Any character (includes all of the above)

Note: the extra condition in STerm should really be repaired by removing ATerm from the definition of STerm in the Proplist file.