L2/07-384

Date: Mon, 15 Oct 2007 13:57:59 -0700
From: Andy Heninger
Subject: UAX 14, shorten lists of characters

A late UTC agenda item.  (If it's too late, postpone it until next time)

UAX-14 includes complete lists of characters for many of the line breaking classes.  I propose that, in cases where these lists contain more than a few characters, that they be replaced by a few representative characters from the class, together with text referring to the data file for the complete list.

The issue is that maintaining the data in parallel between UAX-14 and the LineBreak.txt data file is a potentially error-prone process that does not seem to add much value, and can potentially cause confusion regarding which lists are normative.  The data file is, and remains, normative.

Any individual characters that are specifically discussed in the text would want to remain listed.

Here are lists copied out of TR14 that could reasonably be shortened.

Breaking Spaces

1680     OGHAM SPACE MARK
2000    EN QUAD
2001    EM QUAD
2002    EN SPACE
2003    EM SPACE
2004    THREE-PER-EM SPACE
2005    FOUR-PER-EM SPACE
2006    SIX-PER-EM SPACE
2008    PUNCTUATION SPACE
2009    THIN SPACE
200A    HAIR SPACE
205F     MEDIUM MATHEMATICAL SPACE


Historic Word Separators

16EB     RUNIC SINGLE DOT PUNCTUATION
16EC     RUNIC MULTIPLE DOT PUNCTUATION
16ED     RUNIC CROSS PUNCTUATION
2056     THREE DOT PUNCTUATION
2058     FOUR DOT PUNCTUATION
2059     FIVE DOT PUNCTUATION
205A     TWO DOT PUNCTUATION
205B     FOUR DOT MARK
205D     TRICOLON
205E     VERTICAL FOUR DOTS
10100     AEGEAN WORD SEPARATOR LINE
10101     AEGEAN WORD SEPARATOR DOT
10102     AEGEAN CHECK MARK
1039F     UGARITIC WORD DIVIDER
103D0     OLD PERSIAN WORD DIVIDER
1091F     PHOENICIAN WORD DIVIDER
12470     CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD DIVIDER

Dandas

0964     DEVANAGARI DANDA
0965     DEVANAGARI DOUBLE DANDA
0E5A     THAI CHARACTER ANGKHANKHU
0E5B     THAI CHARACTER KHOMUT
104A     MYANMAR SIGN LITTLE SECTION
104B     MYANMAR SIGN SECTION
1735     PHILIPPINE SINGLE PUNCTUATION
1736     PHILIPPINE DOUBLE PUNCTUATION
17D4     KHMER SIGN KHAN
17D5     KHMER SIGN BARIYOOSAN
1B5E     BALINESE CARIK SIKI
1B5F     BALINESE CARIK PAREREN
A8CE     SAURASHTRA DANDA
A8CF     SAURASHTRA DOUBLE DANDA
10A56     KHAROSHTHI PUNCTUATION DANDA
10A57     KHAROSHTHI PUNCTUATION DOUBLE DANDA

Tibetan

0F34     TIBETAN MARK BSDUS RTAGS
0F7F     TIBETAN SIGN RNAM BCAD
0F85     TIBETAN MARK PALUTA
0FBE     TIBETAN KU RU KHA
0FBF     TIBETAN KU RU KHA BZHI MIG CAN
0FD2     TIBETAN MARK NYIS TSHEG

Other Terminating Punctuation

1804     MONGOLIAN COLON
1805     MONGOLIAN FOUR DOTS
1808     MONGOLIAN MANCHU COMMA
1809     MONGOLIAN MANCHU FULL STOP
1B5A     BALINESE PANTI
1B5B     BALINESE PAMADA
1B5C     BALINESE WINDU
1B5D     BALINESE CARIK PAMUNGKAH
1B60     BALINESE PAMENENG
1C3B     LEPCHA PUNCTUATION TA-ROL
1C3C     LEPCHA PUNCTUATION NYET THYOOM TA-ROL
1C3D     LEPCHA PUNCTUATION CER-WA
1C3E     LEPCHA PUNCTUATION TSHOOK CER-WA
1C3F     LEPCHA PUNCTUATION TSHOOK
1C7E     OL CHIKI PUNCTUATION MUCAAD
1C7F     OL CHIKI PUNCTUATION DOUBLE MUCAAD
2CFA     COPTIC OLD NUBIAN DIRECT QUESTION MARK
2CFB     COPTIC OLD NUBIAN INDIRECT QUESTION MARK
2CFC     COPTIC OLD NUBIAN VERSE DIVIDER
2CFF     COPTIC MORPHOLOGICAL DIVIDER
2E0E..2E15     EDITORIAL CORONIS..UPWARDS ANCORA
2E17     OBLIQUE DOUBLE HYPHEN
A60D     VAI COMMA
A60F     VAI QUESTION MARK
A92E     KAYAH LI SIGN CWI
A92F     KAYAH LI SIGN SHYA
10A50     KHAROSHTHI PUNCTUATION DOT
10A51     KHAROSHTHI PUNCTUATION SMALL CIRCLE
10A52     KHAROSHTHI PUNCTUATION CIRCLE
10A53     KHAROSHTHI PUNCTUATION CRESCENT BAR
10A54     KHAROSHTHI PUNCTUATION MANGALAM
10A55     KHAROSHTHI PUNCTUATION LOTUS

Tibetan and Phags-Pa Head Letters

0F01     TIBETAN MARK GTER YIG MGO TRUNCATED A
0F02     TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA
0F03     TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA
0F04     TIBETAN MARK INITIAL YIG MGO MDUN MA
0F06     TIBETAN MARK CARET YIG MGO PHUR SHAD MA
0F07     TIBETAN MARK YIG MGO TSHEG SHAD MA
0F09     TIBETAN MARK BSKUR YIG MGO
0F0A     TIBETAN MARK BKA- SHOG YIG MGO
0FD0     TIBETAN MARK BSKA- SHOG GI MGO RGYAN
0FD1     TIBETAN MARK MNYAM YIG GI MGO RGYAN
0FD3     TIBETAN MARK INITIAL BRDA RNYING YIG MGO MDUN MA
A874     PHAGS-PA SINGLE HEAD MARK
A875     PHAGS-PA DOUBLE HEAD MARK

CL: Closing Punctuation (XB)

3001..3002IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP
FE11     PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
FE12     PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE50    SMALL COMMA
FE52    SMALL FULL STOP
FF0C    FULLWIDTH COMMA
FF0E    FULLWIDTH FULL STOP
FF61    HALFWIDTH IDEOGRAPHIC FULL STOP
FF64    HALFWIDTH IDEOGRAPHIC COMMA


EX: Exclamation/Interrogation (XB)
0021    EXCLAMATION MARK
003F    QUESTION MARK
05C6     HEBREW PUNCTUATION NUN HAFUKHA
061B     ARABIC SEMICOLON
061E     ARABIC TRIPLE DOT PUNCTUATION MARK
061F     ARABIC QUESTION MARK
06D4     ARABIC FULL STOP
07F9     NKO EXCLAMATION MARK
0F0D     TIBETAN MARK SHAD
0F0E     TIBETAN MARK NYIS SHAD
0F0F     TIBETAN MARK TSHEG SHAD
0F10     TIBETAN MARK NYIS TSHEG SHAD
0F11     TIBETAN MARK RIN CHEN SPUNGS SHAD
0F14     TIBETAN MARK GTER TSHEG
1802     MONGOLIAN COMMA [was BA]
1803     MONGOLIAN FULL STOP [was BA]
1808     MONGOLIAN MANCHU COMMA [was BA]
1809     MONGOLIAN MANCHU FULL STOP [was BA]
1944     LIMBU EXCLAMATION MARK
1945     LIMBU QUESTION MARK
2762     HEAVY EXCLAMATION MARK ORNAMENT
2763     HEAVY HEART EXCLAMATION MARK ORNAMENT
2CF9     COPTIC OLD NUBIAN FULL STOP [was BA]
2CFE     COPTIC FULL STOP [was BA]
A60C     VAI SYLLABLE LENGTHENER
A60E     VAI FULL STOP
A876     PHAGS-PA MARK SHAD
A877     PHAGS-PA MARK DOUBLE SHAD
FE15     PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
FE16     PRESENTATION FORM FOR VERTICAL QUESTION MARK
FE56..FE57 SMALL QUESTION MARK..SMALL EXCLAMATION MARK
FF01    FULLWIDTH EXCLAMATION MARK
FF1F   FULLWIDTH QUESTION MARK

IS: Numeric Separator (Infix) (XB)

002C     COMMA
002E     FULL STOP
003A     COLON
003B     SEMICOLON
037E     GREEK QUESTION MARK (canonically equivalent to 003B)
0589     ARMENIAN FULL STOP
060C     ARABIC COMMA [moved from EX]
060D     ARABIC DATE SEPARATOR
07F8     NKO COMMA
2044     FRACTION SLASH
FE10     PRESENTATION FORM FOR VERTICAL COMMA
FE13     PRESENTATION FORM FOR VERTICAL COLON
FE14     PRESENTATION FORM FOR VERTICAL SEMICOLON

NS: Nonstarters (XB)

17D6    KHMER SIGN CAMNUC PII KUUH
203C    DOUBLE EXCLAMATION MARK
203D     INTERROBANG
2047     DOUBLE QUESTION MARK
2048     QUESTION EXCLAMATION MARK
2049     EXCLAMATION QUESTION MARK
3005    IDEOGRAPHIC ITERATION MARK
301C    WAVE DASH
303C     MASU MARK
303B     VERTICAL IDEOGRAPHIC ITERATION MARK
309B.. 309E KATAKANA-HIRAGANA VOICED SOUND MARK..HIRAGANA VOICED ITERATION MARK
30A0     KATAKANA-HIRAGANA DOUBLE HYPHEN
30FB..30FE KATAKANA MIDDLE DOT..KATAKANA VOICED ITERATION MARK
A015     YI SYLLABLE WU (misnomer for YI SYLLABLE ITERATION MARK)
FE54..FE55 SMALL SEMICOLON..SMALL COLON
FF1A..FF1B FULLWIDTH COLON.. FULLWIDTH SEMICOLON
FF65 HALFWIDTH KATAKANA MIDDLE DOT
FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
FF9E..FF9F     HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK


PO: Postfix (Numeric) (XB)

0025    PERCENT SIGN
00A2    CENT SIGN
00B0    DEGREE SIGN
060B    AFGHANI SIGN
066A     ARABIC PERCENT SIGN [moved from EX]
2030    PER MILLE SIGN
2031    PER TEN THOUSAND SIGN
2032..2037 PRIME..REVERSED TRIPLE PRIME
20A7 PESETA SIGN
2103 DEGREE CELSIUS
2109 DEGREE FAHRENHEIT
FDFC    RIAL SIGN
FE6A    SMALL PERCENT SIGN
FF05    FULLWIDTH PERCENT SIGN
FFE0    FULLWIDTH CENT SIGN

PR: Prefix (Numeric) (XA)

002B    PLUS SIGN
005C    REVERSE SOLIDUS
00B1    PLUS-MINUS
2116    NUMERO SIGN
2212    MINUS SIGN
2213    MINUS-OR-PLUS-SIGN

QU: Ambiguous Quotation (XB/XA)

0022    QUOTATION MARK
0027    APOSTROPHE
275B     HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
275C     HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
275D     HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
275E     HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT