From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Aug 22 2003 - 09:04:21 EDT
Rick McGowan wrote:
> the process as possible so that it can be considered
> The draft is found at http://www.unicode.org/reports/tr31/
> and feedback can be submitted as described there.
(Before submitting official feedback, I'd like to discuss my comments here.
BTW, which "Type of Message" should I use in the feedback form? Is it OK to
use "Technical Report or Tech Note issues"?)
My two cents are both about adding characters in the <Pattern_Syntax> of
"4.1 Proposed Pattern Properties".
IMHO:
1. Full-width, half-width, and "small" punctuation characters should
in class <Pattern_Syntax> as their "normal width" counterparts.
2. Non-Latin punctuation character should be in class
<Pattern_Syntax> as their Latin counterparts.
The rationale for suggestion 1 is that <wide>, <narrow> and <small>
compatibility characters are substantially identical (in appearance and
function) to their "normal width" counterparts. A parser allowing an
unquoted full-width punctuation character in an identifier is guaranteed to
cause confusion to the user.
E.g., consider the following expression:
foo,bar
To me, it *definitely* looks like two identifiers separated by a comma, and
I expect my parser to agree with me on this, even if the "comma" is actually
a full-width comma. I am not saying that the parser must necessarily accept
a full-width comma in that position: it is perfectly OK if the above
expression causes a syntax error such as: "Illegal character U+FF0C
(FULLWIDTH COMMA) after identifier <foo>'".
But what the parser should absolutely *not* do, IMHO, is handling "foo,bar"
as a *single* identifier! Doing such a thing is guaranteed to cause troubles
to me. E.g., I might receive a puzzling error message saying: "Parameter
missing: this statement requires 2 parameters", while I can *see* that there
*are* two parameters: "foo" and "bar"...
The rationale for suggestion 2 is very similar. E.g., the following
expression looks a perfectly legal C++ or Java statement:
return;
If the compiler tells me: "Undeclared identifier", I may get crazy for the
whole day trying to figure out what's going on... But if tells me "Illegal
character U+037E (GREEK QUESTION MARK) after keyword <return>", then I
immediately understand that something is wrong with that "semicolon".
The reason I keep suggestions 1 and 2 separate is that, in the case of
<wide>, <narrow> and <small> compatibility characters, it is trivial to
determine the corresponding regular character, while in the case of
non-Latin punctuation there is room for discussing which punctuation
characters are similar enough (in function or appearance) to which Latin
punctuation character.
For full-width, half-width, and "small" punctuation characters, my
suggestion is to add the following lines to "4.1 Proposed Pattern
Properties":
FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
MARK
FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
SIGN
FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
COMMERCIAL AT
FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
SOLIDUS
FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
AT
FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
BRACKET..FULLWIDTH GRAVE ACCENT
FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
BRACKET..FULLWIDTH TILDE
FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
SIGN
FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
SIGN
FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
VERTICAL..HALFWIDTH WHITE CIRCLE
For non-Latin punctuation characters, this is my tentative list of
characters that may cause trouble if used in identifiers, and which,
consequently, should be added to class <Pattern_Syntax>:
037E GREEK QUESTION MARK
0387 GREEK ANO TELEIA
055C ARMENIAN EXCLAMATION MARK
055D ARMENIAN COMMA
055E ARMENIAN QUESTION MARK
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
060D ARABIC DATE SEPARATOR
061B ARABIC SEMICOLON
061F ARABIC QUESTION MARK
066A ARABIC PERCENT SIGN
066B ARABIC DECIMAL SEPARATOR
066C ARABIC THOUSANDS SEPARATOR
06D4 ARABIC FULL STOP
0964 DEVANAGARI DANDA
0965 DEVANAGARI DOUBLE DANDA
10FB GEORGIAN PARAGRAPH SEPARATOR
1362 ETHIOPIC FULL STOP
1363 ETHIOPIC COMMA
1364 ETHIOPIC SEMICOLON
1365 ETHIOPIC COLON
1366 ETHIOPIC PREFACE COLON
1367 ETHIOPIC QUESTION MARK
1368 ETHIOPIC PARAGRAPH SEPARATOR
166E CANADIAN SYLLABICS FULL STOP
1802 MONGOLIAN COMMA
1803 MONGOLIAN FULL STOP
1804 MONGOLIAN COLON
1808 MONGOLIAN MANCHU COMMA
1809 MONGOLIAN MANCHU FULL STOP
1944 LIMBU EXCLAMATION MARK
1945 LIMBU QUESTION MARK
But I am not 100% about all the above characters. Should any of them be
removed from the list (i.e., allowed in identifiers)?
The following list includes all the non-Latin punctuation character which I
feel not worth including in class <Pattern_Syntax>, because I think that,
for a reason or another, they would cause no problem in identifiers:
055A ARMENIAN APOSTROPHE
055B ARMENIAN EMPHASIS MARK
055F ARMENIAN ABBREVIATION MARK
058A ARMENIAN HYPHEN
05BE HEBREW PUNCTUATION MAQAF
05C0 HEBREW PUNCTUATION PASEQ
05C3 HEBREW PUNCTUATION SOF PASUQ
05F3 HEBREW PUNCTUATION GERESH
05F4 HEBREW PUNCTUATION GERSHAYIM
066D ARABIC FIVE POINTED STAR
0700 SYRIAC END OF PARAGRAPH
0701 SYRIAC SUPRALINEAR FULL STOP
0702 SYRIAC SUBLINEAR FULL STOP
0703 SYRIAC SUPRALINEAR COLON
0704 SYRIAC SUBLINEAR COLON
0705 SYRIAC HORIZONTAL COLON
0706 SYRIAC COLON SKEWED LEFT
0707 SYRIAC COLON SKEWED RIGHT
0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT
0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT
070A SYRIAC CONTRACTION
070B SYRIAC HARKLEAN OBELUS
070C SYRIAC HARKLEAN METOBELUS
070D SYRIAC HARKLEAN ASTERISCUS
0970 DEVANAGARI ABBREVIATION SIGN
0DF4 SINHALA PUNCTUATION KUNDDALIYA
0E4F THAI CHARACTER FONGMAN
0E5A THAI CHARACTER ANGKHANKHU
0E5B THAI CHARACTER KHOMUT
0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA
0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
0F08 TIBETAN MARK SBRUL SHAD
0F09 TIBETAN MARK BSKUR YIG MGO
0F0A TIBETAN MARK BKA- SHOG YIG MGO
0F0B TIBETAN MARK INTERSYLLABIC TSHEG
0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
0F0D TIBETAN MARK SHAD
0F0E TIBETAN MARK NYIS SHAD
0F0F TIBETAN MARK TSHEG SHAD
0F10 TIBETAN MARK NYIS TSHEG SHAD
0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
0F12 TIBETAN MARK RGYA GRAM SHAD
0F3A TIBETAN MARK GUG RTAGS GYON
0F3B TIBETAN MARK GUG RTAGS GYAS
0F3C TIBETAN MARK ANG KHANG GYON
0F3D TIBETAN MARK ANG KHANG GYAS
0F85 TIBETAN MARK PALUTA
104A MYANMAR SIGN LITTLE SECTION
104B MYANMAR SIGN SECTION
104C MYANMAR SYMBOL LOCATIVE
104D MYANMAR SYMBOL COMPLETED
104E MYANMAR SYMBOL AFOREMENTIONED
104F MYANMAR SYMBOL GENITIVE
1361 ETHIOPIC WORDSPACE
166D CANADIAN SYLLABICS CHI SIGN
169B OGHAM FEATHER MARK
169C OGHAM REVERSED FEATHER MARK
16EB RUNIC SINGLE PUNCTUATION
16EC RUNIC MULTIPLE PUNCTUATION
16ED RUNIC CROSS PUNCTUATION
1735 PHILIPPINE SINGLE PUNCTUATION
1736 PHILIPPINE DOUBLE PUNCTUATION
17D4 KHMER SIGN KHAN
17D5 KHMER SIGN BARIYOOSAN
17D6 KHMER SIGN CAMNUC PII KUUH
17D8 KHMER SIGN BEYYAL
17D9 KHMER SIGN PHNAEK MUAN
17DA KHMER SIGN KOOMUUT
1800 MONGOLIAN BIRGA
1801 MONGOLIAN ELLIPSIS
1805 MONGOLIAN FOUR DOTS
1806 MONGOLIAN TODO SOFT HYPHEN
1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
180A MONGOLIAN NIRUGU
10100 AEGEAN WORD SEPARATOR LINE
10101 AEGEAN WORD SEPARATOR DOT
1039F UGARITIC WORD DIVIDER
Should any of the above character be added to <Pattern_Syntax> (i.e. *not*
allowed in identifiers)?
_ Marco
This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 10:12:39 EDT