L2/03-302

Feedback on UTR #31 draft

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Date/Time:    Tue Aug 26 05:36:34 EDT 2003
Contact:      peterkirk@qaya.org
Report Type:  Technical Report or Tech Note issues

A follow-up to my earlier feedback on proposed draft UTR #31 in the light  
of discussions on the Unicode list:

I have serious problems with the concept of defining now an unchangeable  
list of syntax characters. My preference is for a list of syntax characters  
which can be added to but not subtracted from.

I would also suggest that all punctuation characters and all undefined  
characters be reserved i.e. they should not be used unquoted in strings as  
they may be defined as syntax characters in later versions. Implementations  
would not be obliged to check for misuse of these reserved characters, it  
is up to the user to avoid them. (This kind of loose syntax may not be  
ideal but it is common practice e.g. with HTML which most browsers do not  
fully validate. An implementation would be free to check against the list  
of reserved characters in the current UCD if preferred.) But a guarantee  
could be made that characters currently defined in Unicode as  
non-punctuation will never be defined as syntax characters.

The reason why a change is needed is mainly to avoid the ethnocentric  
definition of only Latin punctuation characters as valid syntax characters.  
I also have also seen the serious problems which have resulted from  
premature freezing of inappropriate properties e.g. the combining classes  
of Hebrew points.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Date/Time:    Tue Aug 26 06:53:47 EDT 2003
Contact:      marco.cimarosti@europe.com
Report Type:  Technical Report or Tech Note issues

Feedback on UTR#31 (draft 1): Full/Half-Width Characters.

I suggest that all compatibility character which are labelled <wide>,  
<narrow> and <small> and whose compatibility decompositions is already in  
class <Pattern_Syntax> be added in class <Pattern_Syntax> as well.

In practice, I am suggesting to add the following lines to section "4.1  
Proposed Pattern Properties":

	FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
	FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK
	FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN
	FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT
	FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION  
MARK..FULLWIDTH SOLIDUS
	FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT
	FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE  
BRACKET..FULLWIDTH GRAVE ACCENT
	FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY  
BRACKET..FULLWIDTH TILDE
	FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE  
PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
	FF64       ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
	FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN
	FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN
	FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT  
VERTICAL..HALFWIDTH WHITE CIRCLE

Rationale. These characters are almost identical, visually and  
semantically, to their "normal width" counterparts. Allowing such  
characters in identifiers means allowing identifiers which look identical  
to expressions of a totally different kind. E.g., an identifier such as  
"foo，bar" (where "，" is U+FF0C FULLWIDTH COMMA), would look identical  
to expression "foo, bar" (identifier "foo" + comma + space + identifier  
"bar").

Regards.
Marco Cimarosti (marco.cimarosti@europe.com)

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Date/Time:    Tue Aug 26 06:55:37 EDT 2003
Contact:      marco.cimarosti@europe.com
Report Type:  Technical Report or Tech Note issues

Feedback on UTR#31 (draft 1): Non-Latin Punctuation.

I suggest that a small set of non-Latin punctuation marks be added in  
class <Pattern_Syntax>. Each one of the punctuation marks that I am  
suggesting to include complies with the following conditions:

1) It is very similar in shape to an ASCII-range character which is  
already in class <Pattern_Syntax>;

2) It is very similar in function to an ASCII-range character already  
which is in class <Pattern_Syntax>;

3) It is used in the modern orthography of modern languages and/or it is  
commonly available on national keyboards;

4) It is not commonly used to form words or phrases which may be used as  
identifiers.

In practice, I am suggesting to add the following lines to section "4.1  
Proposed Pattern Properties":

	037E       ; Pattern_Syntax # GREEK QUESTION MARK
	0387       ; Pattern_Syntax # GREEK ANO TELEIA
	055C..055E ; Pattern_Syntax # ARMENIAN EXCLAMATION MARK..ARMENIAN  
QUESTION MARK
	0589       ; Pattern_Syntax # ARMENIAN FULL STOP
	05C0       ; Pattern_Syntax # HEBREW PUNCTUATION PASEQ
	05C3       ; Pattern_Syntax # HEBREW PUNCTUATION SOF PASUQ
	060C..060D ; Pattern_Syntax # ARABIC COMMA..ARABIC DATE SEPARATOR
	061B       ; Pattern_Syntax # ARABIC SEMICOLON
	061F       ; Pattern_Syntax # ARABIC QUESTION MARK
	066A..066C ; Pattern_Syntax # ARABIC PERCENT SIGN..ARABIC  
THOUSANDS SEPARATOR
	06D4       ; Pattern_Syntax # ARABIC FULL STOP
	066D       ; Pattern_Syntax # ARABIC FIVE POINTED STAR
	0964..0965 ; Pattern_Syntax # DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA
	10FB       ; Pattern_Syntax # GEORGIAN PARAGRAPH SEPARATOR
	1362..1368 ; Pattern_Syntax # ETHIOPIC FULL STOP..ETHIOPIC  
PARAGRAPH SEPARATOR

Rationale. Punctuation marks complying with conditions #1 to #3 may easily  
be cofused with ASCII-range characters which are normally used in the  
syntax of computer languages and notations. Allowing such character in  
identifiers would mean to allow identifiers which look almost identical to  
expressions of a totally different kind. E.g., an identifier such as  
"return;" (where ";" is U+037E GREEK QUESTION MARK), looks identical to  
expression "return;" (identifier or keyword "return" + semicolon). However,  
punctuation marks mentioned in condition #4 (e.g. syllable separators,  
morpheme separators, abbreviation marks, diacritic marks, apostrophes) are  
excluded from my suggestion (i.e. I suggest to allow them in identifiers)  
because they are useful to form words or phrases which may act as  
identifiers.

Character-by-character rationale. In the following list, I listed each  
suggested character along with the ASCII-range character which looks  
similar to it (as per condition #1 above) and with the ASCII-range  
character which has a similar function to it (as per condition #2).

	Code	Cnd.#1	Cnd.#2	Character name

	037E	;	?	GREEK QUESTION MARK
	0387	.	;	GREEK ANO TELEIA
	055C	~	!	ARMENIAN EXCLAMATION MARK
	055D	`	,	ARMENIAN COMMA
	055E	^	?	ARMENIAN QUESTION MARK
	0589	:	.	ARMENIAN FULL STOP
	05C0	|	;	HEBREW PUNCTUATION PASEQ
	05C3	:	.	HEBREW PUNCTUATION SOF PASUQ
	060C	,	,	ARABIC COMMA
	060D	,	,	ARABIC DATE SEPARATOR
	061B	;	;	ARABIC SEMICOLON
	061F	?	?	ARABIC QUESTION MARK
	066A	%	%	ARABIC PERCENT SIGN
	066B	,	.	ARABIC DECIMAL SEPARATOR
	066C	,	,	ARABIC THOUSANDS SEPARATOR
	06D4	_	.	ARABIC FULL STOP
	066D	*	*	ARABIC FIVE POINTED STAR
	0964	|	.	DEVANAGARI DANDA
	0965	|	.	DEVANAGARI DOUBLE DANDA
	10FB	:	:	GEORGIAN PARAGRAPH SEPARATOR
	1362	:	.	ETHIOPIC FULL STOP
	1363	:	,	ETHIOPIC COMMA
	1364	:	;	ETHIOPIC SEMICOLON
	1365	:	:	ETHIOPIC COLON
	1366	:	:	ETHIOPIC PREFACE COLON
	1367	|	?	ETHIOPIC QUESTION MARK
	1368	:	.	ETHIOPIC PARAGRAPH SEPARATOR

Regards.
Marco Cimarosti (marco.cimarosti@europe.com)