RE: Proposed Draft UTR #31 - Syntax Characters

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Aug 22 2003 - 09:04:21 EDT

Next message: Jim Allan: "RE: Proposed Draft UTR #31 - Syntax Characters"

Previous message: Jill.Ramonsky@Aculab.com: "RE: Proposed Draft UTR #31 - Syntax Characters"
Maybe in reply to: Rick McGowan: "Proposed Draft UTR #31 - Syntax Characters"
Next in thread: Peter Kirk: "Re: Proposed Draft UTR #31 - Syntax Characters"
Reply: Peter Kirk: "Re: Proposed Draft UTR #31 - Syntax Characters"
Reply: Mark Davis: "Re: Proposed Draft UTR #31 - Syntax Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Rick McGowan wrote:
> the process as possible so that it can be considered
> The draft is found at http://www.unicode.org/reports/tr31/
> and feedback can be submitted as described there.

(Before submitting official feedback, I'd like to discuss my comments here.
BTW, which "Type of Message" should I use in the feedback form? Is it OK to
use "Technical Report or Tech Note issues"?)

My two cents are both about adding characters in the <Pattern_Syntax> of
"4.1 Proposed Pattern Properties".

IMHO:

1. Full-width, half-width, and "small" punctuation characters should
in class <Pattern_Syntax> as their "normal width" counterparts.

2. Non-Latin punctuation character should be in class
<Pattern_Syntax> as their Latin counterparts.

The rationale for suggestion 1 is that <wide>, <narrow> and <small>
compatibility characters are substantially identical (in appearance and
function) to their "normal width" counterparts. A parser allowing an
unquoted full-width punctuation character in an identifier is guaranteed to
cause confusion to the user.

E.g., consider the following expression:

foo，bar

To me, it *definitely* looks like two identifiers separated by a comma, and
I expect my parser to agree with me on this, even if the "comma" is actually
a full-width comma. I am not saying that the parser must necessarily accept
a full-width comma in that position: it is perfectly OK if the above
expression causes a syntax error such as: "Illegal character U+FF0C
(FULLWIDTH COMMA) after identifier <foo>'".

But what the parser should absolutely *not* do, IMHO, is handling "foo，bar"
as a *single* identifier! Doing such a thing is guaranteed to cause troubles
to me. E.g., I might receive a puzzling error message saying: "Parameter
missing: this statement requires 2 parameters", while I can *see* that there
*are* two parameters: "foo" and "bar"...

The rationale for suggestion 2 is very similar. E.g., the following
expression looks a perfectly legal C++ or Java statement:

return;

If the compiler tells me: "Undeclared identifier", I may get crazy for the
whole day trying to figure out what's going on... But if tells me "Illegal
character U+037E (GREEK QUESTION MARK) after keyword <return>", then I
immediately understand that something is wrong with that "semicolon".

The reason I keep suggestions 1 and 2 separate is that, in the case of
<wide>, <narrow> and <small> compatibility characters, it is trivial to
determine the corresponding regular character, while in the case of
non-Latin punctuation there is room for discussing which punctuation
characters are similar enough (in function or appearance) to which Latin
punctuation character.

For full-width, half-width, and "small" punctuation characters, my
suggestion is to add the following lines to "4.1 Proposed Pattern
Properties":

        FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
        FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
MARK
        FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
SIGN
        FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
COMMERCIAL AT
        FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
SOLIDUS
        FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
AT
        FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
BRACKET..FULLWIDTH GRAVE ACCENT
        FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
BRACKET..FULLWIDTH TILDE
        FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
        FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
        FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
SIGN
        FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
SIGN
        FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
VERTICAL..HALFWIDTH WHITE CIRCLE

For non-Latin punctuation characters, this is my tentative list of
characters that may cause trouble if used in identifiers, and which,
consequently, should be added to class <Pattern_Syntax>:

        037E GREEK QUESTION MARK
        0387 GREEK ANO TELEIA
        055C ARMENIAN EXCLAMATION MARK
        055D ARMENIAN COMMA
        055E ARMENIAN QUESTION MARK
        0589 ARMENIAN FULL STOP
        060C ARABIC COMMA
        060D ARABIC DATE SEPARATOR
        061B ARABIC SEMICOLON
        061F ARABIC QUESTION MARK
        066A ARABIC PERCENT SIGN
        066B ARABIC DECIMAL SEPARATOR
        066C ARABIC THOUSANDS SEPARATOR
        06D4 ARABIC FULL STOP
        0964 DEVANAGARI DANDA
        0965 DEVANAGARI DOUBLE DANDA
        10FB GEORGIAN PARAGRAPH SEPARATOR
        1362 ETHIOPIC FULL STOP
        1363 ETHIOPIC COMMA
        1364 ETHIOPIC SEMICOLON
        1365 ETHIOPIC COLON
        1366 ETHIOPIC PREFACE COLON
        1367 ETHIOPIC QUESTION MARK
        1368 ETHIOPIC PARAGRAPH SEPARATOR
        166E CANADIAN SYLLABICS FULL STOP
        1802 MONGOLIAN COMMA
        1803 MONGOLIAN FULL STOP
        1804 MONGOLIAN COLON
        1808 MONGOLIAN MANCHU COMMA
        1809 MONGOLIAN MANCHU FULL STOP
        1944 LIMBU EXCLAMATION MARK
        1945 LIMBU QUESTION MARK

But I am not 100% about all the above characters. Should any of them be
removed from the list (i.e., allowed in identifiers)?

The following list includes all the non-Latin punctuation character which I
feel not worth including in class <Pattern_Syntax>, because I think that,
for a reason or another, they would cause no problem in identifiers:

        055A ARMENIAN APOSTROPHE
        055B ARMENIAN EMPHASIS MARK
        055F ARMENIAN ABBREVIATION MARK
        058A ARMENIAN HYPHEN
        05BE HEBREW PUNCTUATION MAQAF
        05C0 HEBREW PUNCTUATION PASEQ
        05C3 HEBREW PUNCTUATION SOF PASUQ
        05F3 HEBREW PUNCTUATION GERESH
        05F4 HEBREW PUNCTUATION GERSHAYIM
        066D ARABIC FIVE POINTED STAR
        0700 SYRIAC END OF PARAGRAPH
        0701 SYRIAC SUPRALINEAR FULL STOP
        0702 SYRIAC SUBLINEAR FULL STOP
        0703 SYRIAC SUPRALINEAR COLON
        0704 SYRIAC SUBLINEAR COLON
        0705 SYRIAC HORIZONTAL COLON
        0706 SYRIAC COLON SKEWED LEFT
        0707 SYRIAC COLON SKEWED RIGHT
        0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT
        0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT
        070A SYRIAC CONTRACTION
        070B SYRIAC HARKLEAN OBELUS
        070C SYRIAC HARKLEAN METOBELUS
        070D SYRIAC HARKLEAN ASTERISCUS
        0970 DEVANAGARI ABBREVIATION SIGN
        0DF4 SINHALA PUNCTUATION KUNDDALIYA
        0E4F THAI CHARACTER FONGMAN
        0E5A THAI CHARACTER ANGKHANKHU
        0E5B THAI CHARACTER KHOMUT
        0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
        0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA
        0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
        0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
        0F08 TIBETAN MARK SBRUL SHAD
        0F09 TIBETAN MARK BSKUR YIG MGO
        0F0A TIBETAN MARK BKA- SHOG YIG MGO
        0F0B TIBETAN MARK INTERSYLLABIC TSHEG
        0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
        0F0D TIBETAN MARK SHAD
        0F0E TIBETAN MARK NYIS SHAD
        0F0F TIBETAN MARK TSHEG SHAD
        0F10 TIBETAN MARK NYIS TSHEG SHAD
        0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
        0F12 TIBETAN MARK RGYA GRAM SHAD
        0F3A TIBETAN MARK GUG RTAGS GYON
        0F3B TIBETAN MARK GUG RTAGS GYAS
        0F3C TIBETAN MARK ANG KHANG GYON
        0F3D TIBETAN MARK ANG KHANG GYAS
        0F85 TIBETAN MARK PALUTA
        104A MYANMAR SIGN LITTLE SECTION
        104B MYANMAR SIGN SECTION
        104C MYANMAR SYMBOL LOCATIVE
        104D MYANMAR SYMBOL COMPLETED
        104E MYANMAR SYMBOL AFOREMENTIONED
        104F MYANMAR SYMBOL GENITIVE
        1361 ETHIOPIC WORDSPACE
        166D CANADIAN SYLLABICS CHI SIGN
        169B OGHAM FEATHER MARK
        169C OGHAM REVERSED FEATHER MARK
        16EB RUNIC SINGLE PUNCTUATION
        16EC RUNIC MULTIPLE PUNCTUATION
        16ED RUNIC CROSS PUNCTUATION
        1735 PHILIPPINE SINGLE PUNCTUATION
        1736 PHILIPPINE DOUBLE PUNCTUATION
        17D4 KHMER SIGN KHAN
        17D5 KHMER SIGN BARIYOOSAN
        17D6 KHMER SIGN CAMNUC PII KUUH
        17D8 KHMER SIGN BEYYAL
        17D9 KHMER SIGN PHNAEK MUAN
        17DA KHMER SIGN KOOMUUT
        1800 MONGOLIAN BIRGA
        1801 MONGOLIAN ELLIPSIS
        1805 MONGOLIAN FOUR DOTS
        1806 MONGOLIAN TODO SOFT HYPHEN
        1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
        180A MONGOLIAN NIRUGU
        10100 AEGEAN WORD SEPARATOR LINE
        10101 AEGEAN WORD SEPARATOR DOT
        1039F UGARITIC WORD DIVIDER

Should any of the above character be added to <Pattern_Syntax> (i.e. *not*
allowed in identifiers)?

_ Marco

Next message: Jim Allan: "RE: Proposed Draft UTR #31 - Syntax Characters"
Previous message: Jill.Ramonsky@Aculab.com: "RE: Proposed Draft UTR #31 - Syntax Characters"
Maybe in reply to: Rick McGowan: "Proposed Draft UTR #31 - Syntax Characters"
Next in thread: Peter Kirk: "Re: Proposed Draft UTR #31 - Syntax Characters"
Reply: Peter Kirk: "Re: Proposed Draft UTR #31 - Syntax Characters"
Reply: Mark Davis: "Re: Proposed Draft UTR #31 - Syntax Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 10:12:39 EDT