L2/00-387 Proposed changes to PropList.txt From: Ken Whistler, Mark Davis Date: 2000-10-27 We propose that the UTC adopt a new version of PropList.txt for inclusion into the next version of the standard. Please look this over so that you are prepared for the discussion at the meeting. Up until now, PropList.txt has been simply a dump of Sybase's implementation data. From broader perspective, there are a number of problems with the current file: 1. The file is not in a standard semicolon-delimited format. Nor is it in a pleasant machine-readable format: not even all the comment lines are commented, and you have to dig information out of the text. 2. The data has not been vetted by the UTC. 3. There also are cases where properties that should have been (at least partially) aligned with the general category have gotten out of sync. 4. There are redundant properties (such as copies of the BIDI properties) Yet there is a lot of valuable data in the file, and the file provides a mechanism for adding more information to the standard about the usage of characters. To address this, we have analysed the old data, and produced a new version of PropList.txt, which is placed on the alpha site for Unicode 3.1 for discussion: http://www.unicode.org/Public/3.1-Update/PropList-3.1.0d1.alpha.txt [Note: there is one known error in the file: the property Format_Control is redundant and should have been deleted. It will be in the next rev.] The new file has these characteristics: 0. The format is standardized, and now machine-readable. 1. Almost all redundant properties have been dropped. (There are three remaining redundant properties: Identifier_Part_Not_Cf, Private_Use, Unassigned_Code_Point. Composite may or may not be, depending on its exact composition.) 2. Some properties are described as Other_X, where the X property is readily derived from other properties. This reduces the size of the file, and makes the data more maintainable. 3. Data has been fixed where synchronization errors crept in. 4. In the format, names have been added, and ranges that span different general categories have been broken apart, making it easier to see and understand the contents. What remains to be done is to provide documentation of the meaning of the different properties. There are some open issues, and we leave them to discussion in the UTC. Please look over the data files, and consider the following issues for the meeting: a. Should ZERO-WIDTH SPACE be in Whitespace? b. Should the Hangul Jamo from 2.0 that we "Uncomposed" in 3.0 be added to Composite? c. Should letters with strokes, hooks, descenders be added to Composite? d. Should all circled items be Composite? e. Should hex digits include full-width variants? f. Should the remaining 4 redundant properties be removed? If not, should we add Identifier_Start? g. Should we add the following separate properties for maintainability (see end of file). The larger categories are then computed (if we leave them in). See the end of this file: Other_Math (current Math - Sm) Other_Alphabetic (current Alphabetic - Ll - Lu - Lo - Nl) h. Should we flag the derived properties in the following way: h1. Add an extra field to each line, which is D; for derived I; for independent. h2. Set all lines to have to "I" except for the following sections (assuming they are nor removed in (f)). One example line is given. Add the comment at the top of each section, saying what each is derived from. ### Math ### # Derived from Other_Math + Sm 0028; ; Math; D; # Ps; 1; LEFT PARENTHESIS ... ### Composite ### # Derived from Other_Composite + set of characters x such that: # {decomposition length of x > 1 && canon_decomp(x) != SPACE + Cm + Cm*} 00BC; 00BE; Composite; D; # No; 3; VULGAR FRACTION ONE QUARTER.. ... ### Alphabetic ### # Derived from Other_Alphabetic + Ll + Lu + Lo + Nl 0041; 005A; Alphabetic; D; # L&; 26; LATIN CAPITAL LETTER A.. ... ### Identifier_Part_Not_Cf ### # Derived from Ll + Lu + Lo + Lt + Lm + Nl + Mn + Mc + Nd + Pc 0030; 0039; Identifier_Part_Not_Cf; D; # Nd; 10; DIGIT ZERO.. ... ### Private_Use ### # Derived from Co E000; F8FF; Private_Use; D; # Co; 6400; .. ... ### Unassigned_Code_Point ### # Derived from Cn 0220; 0221; Unassigned_Code_Point; D; # Cn; 2; ... =============== The following are the Other_Math and Other_Alphabetic, created by subtraction. 0028; 002A; Other_Math; # Ps; 3; LEFT PARENTHESIS... 002D; ; Other_Math; # Pd; 1; HYPHEN-MINUS 002F; ; Other_Math; # Po; 1; SOLIDUS 005B; 005E; Other_Math; # Ps; 4; LEFT SQUARE BRACKET... 007B; ; Other_Math; # Ps; 1; LEFT CURLY BRACKET 007D; ; Other_Math; # Pe; 1; RIGHT CURLY BRACKET 2016; ; Other_Math; # Po; 1; DOUBLE VERTICAL LINE 2032; 2034; Other_Math; # Po; 3; PRIME... 207D; 207E; Other_Math; # Ps; 2; SUPERSCRIPT LEFT PARENTHESIS... 208D; 208E; Other_Math; # Ps; 2; SUBSCRIPT LEFT PARENTHESIS... 20D0; 20DC; Other_Math; # Mn; 13; COMBINING LEFT HARPOON ABOVE... 20E1; ; Other_Math; # Mn; 1; COMBINING LEFT RIGHT ARROW ABOVE 2329; 232A; Other_Math; # Ps; 2; LEFT-POINTING ANGLE BRACKET... 300A; 300B; Other_Math; # Ps; 2; LEFT DOUBLE ANGLE BRACKET... 301A; 301B; Other_Math; # Ps; 2; LEFT WHITE SQUARE BRACKET... FE35; FE38; Other_Math; # Ps; 4; PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS... FE59; FE5C; Other_Math; # Ps; 4; SMALL LEFT PARENTHESIS... FE61; ; Other_Math; # Po; 1; SMALL ASTERISK FE63; ; Other_Math; # Pd; 1; SMALL HYPHEN-MINUS FE68; ; Other_Math; # Po; 1; SMALL REVERSE SOLIDUS FF08; FF0A; Other_Math; # Ps; 3; FULLWIDTH LEFT PARENTHESIS... FF0D; ; Other_Math; # Pd; 1; FULLWIDTH HYPHEN-MINUS FF0F; ; Other_Math; # Po; 1; FULLWIDTH SOLIDUS FF3B; FF3E; Other_Math; # Ps; 4; FULLWIDTH LEFT SQUARE BRACKET... FF5B; ; Other_Math; # Ps; 1; FULLWIDTH LEFT CURLY BRACKET FF5D; ; Other_Math; # Pe; 1; FULLWIDTH RIGHT CURLY BRACKET 02B0; 02B8; Other_Alphabetic; # Lm; 9; MODIFIER LETTER SMALL H... 02BB; 02C1; Other_Alphabetic; # Lm; 7; MODIFIER LETTER TURNED COMMA... 02E0; 02E4; Other_Alphabetic; # Lm; 5; MODIFIER LETTER SMALL GAMMA... 02EE; ; Other_Alphabetic; # Lm; 1; MODIFIER LETTER DOUBLE APOSTROPHE 0345; ; Other_Alphabetic; # Mn; 1; COMBINING GREEK YPOGEGRAMMENI 037A; ; Other_Alphabetic; # Lm; 1; GREEK YPOGEGRAMMENI 0559; ; Other_Alphabetic; # Lm; 1; ARMENIAN MODIFIER LETTER LEFT HALF RING 05B0; 05B9; Other_Alphabetic; # Mn; 10; HEBREW POINT SHEVA... 05BB; 05BD; Other_Alphabetic; # Mn; 3; HEBREW POINT QUBUTS... 05BF; ; Other_Alphabetic; # Mn; 1; HEBREW POINT RAFE 05C1; 05C2; Other_Alphabetic; # Mn; 2; HEBREW POINT SHIN DOT... 05C4; ; Other_Alphabetic; # Mn; 1; HEBREW MARK UPPER DOT 064B; 0655; Other_Alphabetic; # Mn; 11; ARABIC FATHATAN... 0670; ; Other_Alphabetic; # Mn; 1; ARABIC LETTER SUPERSCRIPT ALEF 06D6; 06DC; Other_Alphabetic; # Mn; 7; ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA... 06E1; 06E8; Other_Alphabetic; # Mn; 8; ARABIC SMALL HIGH DOTLESS HEAD OF KHAH... 06ED; ; Other_Alphabetic; # Mn; 1; ARABIC SMALL LOW MEEM 0711; ; Other_Alphabetic; # Mn; 1; SYRIAC LETTER SUPERSCRIPT ALAPH 0730; 073F; Other_Alphabetic; # Mn; 16; SYRIAC PTHAHA ABOVE... 07A6; 07B0; Other_Alphabetic; # Mn; 11; THAANA ABAFILI... 0901; 0903; Other_Alphabetic; # Mn; 3; DEVANAGARI SIGN CANDRABINDU... 093E; 094C; Other_Alphabetic; # Mc; 15; DEVANAGARI VOWEL SIGN AA... 0962; 0963; Other_Alphabetic; # Mn; 2; DEVANAGARI VOWEL SIGN VOCALIC L... 0981; 0983; Other_Alphabetic; # Mn; 3; BENGALI SIGN CANDRABINDU... 09BE; 09C4; Other_Alphabetic; # Mc; 7; BENGALI VOWEL SIGN AA... 09C7; 09C8; Other_Alphabetic; # Mc; 2; BENGALI VOWEL SIGN E... 09CB; 09CC; Other_Alphabetic; # Mc; 2; BENGALI VOWEL SIGN O... 09D7; ; Other_Alphabetic; # Mc; 1; BENGALI AU LENGTH MARK 09E2; 09E3; Other_Alphabetic; # Mn; 2; BENGALI VOWEL SIGN VOCALIC L... 0A02; ; Other_Alphabetic; # Mn; 1; GURMUKHI SIGN BINDI 0A3E; 0A42; Other_Alphabetic; # Mc; 5; GURMUKHI VOWEL SIGN AA... 0A47; 0A48; Other_Alphabetic; # Mn; 2; GURMUKHI VOWEL SIGN EE... 0A4B; 0A4C; Other_Alphabetic; # Mn; 2; GURMUKHI VOWEL SIGN OO... 0A70; 0A71; Other_Alphabetic; # Mn; 2; GURMUKHI TIPPI... 0A81; 0A83; Other_Alphabetic; # Mn; 3; GUJARATI SIGN CANDRABINDU... 0ABE; 0AC5; Other_Alphabetic; # Mc; 8; GUJARATI VOWEL SIGN AA... 0AC7; 0AC9; Other_Alphabetic; # Mn; 3; GUJARATI VOWEL SIGN E... 0ACB; 0ACC; Other_Alphabetic; # Mc; 2; GUJARATI VOWEL SIGN O... 0B01; 0B03; Other_Alphabetic; # Mn; 3; ORIYA SIGN CANDRABINDU... 0B3E; 0B43; Other_Alphabetic; # Mc; 6; ORIYA VOWEL SIGN AA... 0B47; 0B48; Other_Alphabetic; # Mc; 2; ORIYA VOWEL SIGN E... 0B4B; 0B4C; Other_Alphabetic; # Mc; 2; ORIYA VOWEL SIGN O... 0B56; 0B57; Other_Alphabetic; # Mn; 2; ORIYA AI LENGTH MARK... 0B82; 0B83; Other_Alphabetic; # Mn; 2; TAMIL SIGN ANUSVARA... 0BBE; 0BC2; Other_Alphabetic; # Mc; 5; TAMIL VOWEL SIGN AA... 0BC6; 0BC8; Other_Alphabetic; # Mc; 3; TAMIL VOWEL SIGN E... 0BCA; 0BCC; Other_Alphabetic; # Mc; 3; TAMIL VOWEL SIGN O... 0BD7; ; Other_Alphabetic; # Mc; 1; TAMIL AU LENGTH MARK 0C01; 0C03; Other_Alphabetic; # Mc; 3; TELUGU SIGN CANDRABINDU... 0C3E; 0C44; Other_Alphabetic; # Mn; 7; TELUGU VOWEL SIGN AA... 0C46; 0C48; Other_Alphabetic; # Mn; 3; TELUGU VOWEL SIGN E... 0C4A; 0C4C; Other_Alphabetic; # Mn; 3; TELUGU VOWEL SIGN O... 0C55; 0C56; Other_Alphabetic; # Mn; 2; TELUGU LENGTH MARK... 0C82; 0C83; Other_Alphabetic; # Mc; 2; KANNADA SIGN ANUSVARA... 0CBE; 0CC4; Other_Alphabetic; # Mc; 7; KANNADA VOWEL SIGN AA... 0CC6; 0CC8; Other_Alphabetic; # Mn; 3; KANNADA VOWEL SIGN E... 0CCA; 0CCC; Other_Alphabetic; # Mc; 3; KANNADA VOWEL SIGN O... 0CD5; 0CD6; Other_Alphabetic; # Mc; 2; KANNADA LENGTH MARK... 0D02; 0D03; Other_Alphabetic; # Mc; 2; MALAYALAM SIGN ANUSVARA... 0D3E; 0D43; Other_Alphabetic; # Mc; 6; MALAYALAM VOWEL SIGN AA... 0D46; 0D48; Other_Alphabetic; # Mc; 3; MALAYALAM VOWEL SIGN E... 0D4A; 0D4C; Other_Alphabetic; # Mc; 3; MALAYALAM VOWEL SIGN O... 0D57; ; Other_Alphabetic; # Mc; 1; MALAYALAM AU LENGTH MARK 0D82; 0D83; Other_Alphabetic; # Mc; 2; SINHALA SIGN ANUSVARAYA... 0DCF; 0DD4; Other_Alphabetic; # Mc; 6; SINHALA VOWEL SIGN AELA-PILLA... 0DD6; ; Other_Alphabetic; # Mn; 1; SINHALA VOWEL SIGN DIGA PAA-PILLA 0DD8; 0DDF; Other_Alphabetic; # Mc; 8; SINHALA VOWEL SIGN GAETTA-PILLA... 0DF2; 0DF3; Other_Alphabetic; # Mc; 2; SINHALA VOWEL SIGN DIGA GAETTA-PILLA... 0E31; ; Other_Alphabetic; # Mn; 1; THAI CHARACTER MAI HAN-AKAT 0E34; 0E3A; Other_Alphabetic; # Mn; 7; THAI CHARACTER SARA I... 0E4D; ; Other_Alphabetic; # Mn; 1; THAI CHARACTER NIKHAHIT 0EB1; ; Other_Alphabetic; # Mn; 1; LAO VOWEL SIGN MAI KAN 0EB4; 0EB9; Other_Alphabetic; # Mn; 6; LAO VOWEL SIGN I... 0EBB; 0EBC; Other_Alphabetic; # Mn; 2; LAO VOWEL SIGN MAI KON... 0ECD; ; Other_Alphabetic; # Mn; 1; LAO NIGGAHITA 0F71; 0F81; Other_Alphabetic; # Mn; 17; TIBETAN VOWEL SIGN AA... 0F90; 0F97; Other_Alphabetic; # Mn; 8; TIBETAN SUBJOINED LETTER KA... 0F99; 0FBC; Other_Alphabetic; # Mn; 36; TIBETAN SUBJOINED LETTER NYA... 102C; 1032; Other_Alphabetic; # Mc; 7; MYANMAR VOWEL SIGN AA... 1036; ; Other_Alphabetic; # Mn; 1; MYANMAR SIGN ANUSVARA 1038; ; Other_Alphabetic; # Mc; 1; MYANMAR SIGN VISARGA 1056; 1059; Other_Alphabetic; # Mc; 4; MYANMAR VOWEL SIGN VOCALIC R... 17B4; 17C8; Other_Alphabetic; # Mc; 21; KHMER VOWEL INHERENT AQ... 18A9; ; Other_Alphabetic; # Mn; 1; MONGOLIAN LETTER ALI GALI DAGALGA FB1E; ; Other_Alphabetic; # Mn; 1; HEBREW POINT JUDEO-SPANISH VARIKA End of document 1