L2/03-302 Feedback on UTR #31 draft -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Tue Aug 26 05:36:34 EDT 2003 Contact: peterkirk@qaya.org Report Type: Technical Report or Tech Note issues A follow-up to my earlier feedback on proposed draft UTR #31 in the light of discussions on the Unicode list: I have serious problems with the concept of defining now an unchangeable list of syntax characters. My preference is for a list of syntax characters which can be added to but not subtracted from. I would also suggest that all punctuation characters and all undefined characters be reserved i.e. they should not be used unquoted in strings as they may be defined as syntax characters in later versions. Implementations would not be obliged to check for misuse of these reserved characters, it is up to the user to avoid them. (This kind of loose syntax may not be ideal but it is common practice e.g. with HTML which most browsers do not fully validate. An implementation would be free to check against the list of reserved characters in the current UCD if preferred.) But a guarantee could be made that characters currently defined in Unicode as non-punctuation will never be defined as syntax characters. The reason why a change is needed is mainly to avoid the ethnocentric definition of only Latin punctuation characters as valid syntax characters. I also have also seen the serious problems which have resulted from premature freezing of inappropriate properties e.g. the combining classes of Hebrew points. -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Tue Aug 26 06:53:47 EDT 2003 Contact: marco.cimarosti@europe.com Report Type: Technical Report or Tech Note issues Feedback on UTR#31 (draft 1): Full/Half-Width Characters. I suggest that all compatibility character which are labelled , and and whose compatibility decompositions is already in class be added in class as well. In practice, I am suggesting to add the following lines to section "4.1 Proposed Pattern Properties": FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH SOLIDUS FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE BRACKET..FULLWIDTH GRAVE ACCENT FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY BRACKET..FULLWIDTH TILDE FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WHITE CIRCLE Rationale. These characters are almost identical, visually and semantically, to their "normal width" counterparts. Allowing such characters in identifiers means allowing identifiers which look identical to expressions of a totally different kind. E.g., an identifier such as "foo,bar" (where "," is U+FF0C FULLWIDTH COMMA), would look identical to expression "foo, bar" (identifier "foo" + comma + space + identifier "bar"). Regards. Marco Cimarosti (marco.cimarosti@europe.com) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Tue Aug 26 06:55:37 EDT 2003 Contact: marco.cimarosti@europe.com Report Type: Technical Report or Tech Note issues Feedback on UTR#31 (draft 1): Non-Latin Punctuation. I suggest that a small set of non-Latin punctuation marks be added in class . Each one of the punctuation marks that I am suggesting to include complies with the following conditions: 1) It is very similar in shape to an ASCII-range character which is already in class ; 2) It is very similar in function to an ASCII-range character already which is in class ; 3) It is used in the modern orthography of modern languages and/or it is commonly available on national keyboards; 4) It is not commonly used to form words or phrases which may be used as identifiers. In practice, I am suggesting to add the following lines to section "4.1 Proposed Pattern Properties": 037E ; Pattern_Syntax # GREEK QUESTION MARK 0387 ; Pattern_Syntax # GREEK ANO TELEIA 055C..055E ; Pattern_Syntax # ARMENIAN EXCLAMATION MARK..ARMENIAN QUESTION MARK 0589 ; Pattern_Syntax # ARMENIAN FULL STOP 05C0 ; Pattern_Syntax # HEBREW PUNCTUATION PASEQ 05C3 ; Pattern_Syntax # HEBREW PUNCTUATION SOF PASUQ 060C..060D ; Pattern_Syntax # ARABIC COMMA..ARABIC DATE SEPARATOR 061B ; Pattern_Syntax # ARABIC SEMICOLON 061F ; Pattern_Syntax # ARABIC QUESTION MARK 066A..066C ; Pattern_Syntax # ARABIC PERCENT SIGN..ARABIC THOUSANDS SEPARATOR 06D4 ; Pattern_Syntax # ARABIC FULL STOP 066D ; Pattern_Syntax # ARABIC FIVE POINTED STAR 0964..0965 ; Pattern_Syntax # DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA 10FB ; Pattern_Syntax # GEORGIAN PARAGRAPH SEPARATOR 1362..1368 ; Pattern_Syntax # ETHIOPIC FULL STOP..ETHIOPIC PARAGRAPH SEPARATOR Rationale. Punctuation marks complying with conditions #1 to #3 may easily be cofused with ASCII-range characters which are normally used in the syntax of computer languages and notations. Allowing such character in identifiers would mean to allow identifiers which look almost identical to expressions of a totally different kind. E.g., an identifier such as "return;" (where ";" is U+037E GREEK QUESTION MARK), looks identical to expression "return;" (identifier or keyword "return" + semicolon). However, punctuation marks mentioned in condition #4 (e.g. syllable separators, morpheme separators, abbreviation marks, diacritic marks, apostrophes) are excluded from my suggestion (i.e. I suggest to allow them in identifiers) because they are useful to form words or phrases which may act as identifiers. Character-by-character rationale. In the following list, I listed each suggested character along with the ASCII-range character which looks similar to it (as per condition #1 above) and with the ASCII-range character which has a similar function to it (as per condition #2). Code Cnd.#1 Cnd.#2 Character name 037E ; ? GREEK QUESTION MARK 0387 . ; GREEK ANO TELEIA 055C ~ ! ARMENIAN EXCLAMATION MARK 055D ` , ARMENIAN COMMA 055E ^ ? ARMENIAN QUESTION MARK 0589 : . ARMENIAN FULL STOP 05C0 | ; HEBREW PUNCTUATION PASEQ 05C3 : . HEBREW PUNCTUATION SOF PASUQ 060C , , ARABIC COMMA 060D , , ARABIC DATE SEPARATOR 061B ; ; ARABIC SEMICOLON 061F ? ? ARABIC QUESTION MARK 066A % % ARABIC PERCENT SIGN 066B , . ARABIC DECIMAL SEPARATOR 066C , , ARABIC THOUSANDS SEPARATOR 06D4 _ . ARABIC FULL STOP 066D * * ARABIC FIVE POINTED STAR 0964 | . DEVANAGARI DANDA 0965 | . DEVANAGARI DOUBLE DANDA 10FB : : GEORGIAN PARAGRAPH SEPARATOR 1362 : . ETHIOPIC FULL STOP 1363 : , ETHIOPIC COMMA 1364 : ; ETHIOPIC SEMICOLON 1365 : : ETHIOPIC COLON 1366 : : ETHIOPIC PREFACE COLON 1367 | ? ETHIOPIC QUESTION MARK 1368 : . ETHIOPIC PARAGRAPH SEPARATOR Regards. Marco Cimarosti (marco.cimarosti@europe.com)