Re: Proposed Draft UTR #31 - Syntax Characters

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Aug 22 2003 - 11:33:36 EDT

Next message: Jony Rosenne: "RE: Proposed Draft UTR #31 - Syntax Characters"

Previous message: Mark Davis: "Re: Proposed Draft UTR #31 - Syntax Characters"
In reply to: Marco Cimarosti: "RE: Proposed Draft UTR #31 - Syntax Characters"
Next in thread: Jim Allan: "RE: Proposed Draft UTR #31 - Syntax Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Technical Report issues would be fine.

I think #1 is worth considering. For #2, see other message to Peter Kirk.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: <unicode@unicode.org>
Sent: Friday, August 22, 2003 06:04
Subject: RE: Proposed Draft UTR #31 - Syntax Characters

> Rick McGowan wrote:
> > the process as possible so that it can be considered
> > The draft is found at http://www.unicode.org/reports/tr31/
> > and feedback can be submitted as described there.
>
> (Before submitting official feedback, I'd like to discuss my comments here.
> BTW, which "Type of Message" should I use in the feedback form? Is it OK to
> use "Technical Report or Tech Note issues"?)
>
>
> My two cents are both about adding characters in the <Pattern_Syntax> of
> "4.1 Proposed Pattern Properties".
>
> IMHO:
>
> 1. Full-width, half-width, and "small" punctuation characters should
> in class <Pattern_Syntax> as their "normal width" counterparts.
>
> 2. Non-Latin punctuation character should be in class
> <Pattern_Syntax> as their Latin counterparts.
>
> The rationale for suggestion 1 is that <wide>, <narrow> and <small>
> compatibility characters are substantially identical (in appearance and
> function) to their "normal width" counterparts. A parser allowing an
> unquoted full-width punctuation character in an identifier is guaranteed to
> cause confusion to the user.
>
> E.g., consider the following expression:
>
> foo，bar
>
> To me, it *definitely* looks like two identifiers separated by a comma, and
> I expect my parser to agree with me on this, even if the "comma" is actually
> a full-width comma. I am not saying that the parser must necessarily accept
> a full-width comma in that position: it is perfectly OK if the above
> expression causes a syntax error such as: "Illegal character U+FF0C
> (FULLWIDTH COMMA) after identifier <foo>'".
>
> But what the parser should absolutely *not* do, IMHO, is handling "foo，bar"
> as a *single* identifier! Doing such a thing is guaranteed to cause troubles
> to me. E.g., I might receive a puzzling error message saying: "Parameter
> missing: this statement requires 2 parameters", while I can *see* that there
> *are* two parameters: "foo" and "bar"...
>
> The rationale for suggestion 2 is very similar. E.g., the following
> expression looks a perfectly legal C++ or Java statement:
>
> return;
>
> If the compiler tells me: "Undeclared identifier", I may get crazy for the
> whole day trying to figure out what's going on... But if tells me "Illegal
> character U+037E (GREEK QUESTION MARK) after keyword <return>", then I
> immediately understand that something is wrong with that "semicolon".
>
> The reason I keep suggestions 1 and 2 separate is that, in the case of
> <wide>, <narrow> and <small> compatibility characters, it is trivial to
> determine the corresponding regular character, while in the case of
> non-Latin punctuation there is room for discussing which punctuation
> characters are similar enough (in function or appearance) to which Latin
> punctuation character.
>
> For full-width, half-width, and "small" punctuation characters, my
> suggestion is to add the following lines to "4.1 Proposed Pattern
> Properties":
>
> FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
> FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
> MARK
> FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
> SIGN
> FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
> COMMERCIAL AT
> FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
> SOLIDUS
> FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
> AT
> FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
> BRACKET..FULLWIDTH GRAVE ACCENT
> FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
> BRACKET..FULLWIDTH TILDE
> FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
> PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
> FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
> FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
> SIGN
> FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
> SIGN
> FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
> VERTICAL..HALFWIDTH WHITE CIRCLE
>
> For non-Latin punctuation characters, this is my tentative list of
> characters that may cause trouble if used in identifiers, and which,
> consequently, should be added to class <Pattern_Syntax>:
>
> 037E GREEK QUESTION MARK
> 0387 GREEK ANO TELEIA
> 055C ARMENIAN EXCLAMATION MARK
> 055D ARMENIAN COMMA
> 055E ARMENIAN QUESTION MARK
> 0589 ARMENIAN FULL STOP
> 060C ARABIC COMMA
> 060D ARABIC DATE SEPARATOR
> 061B ARABIC SEMICOLON
> 061F ARABIC QUESTION MARK
> 066A ARABIC PERCENT SIGN
> 066B ARABIC DECIMAL SEPARATOR
> 066C ARABIC THOUSANDS SEPARATOR
> 06D4 ARABIC FULL STOP
> 0964 DEVANAGARI DANDA
> 0965 DEVANAGARI DOUBLE DANDA
> 10FB GEORGIAN PARAGRAPH SEPARATOR
> 1362 ETHIOPIC FULL STOP
> 1363 ETHIOPIC COMMA
> 1364 ETHIOPIC SEMICOLON
> 1365 ETHIOPIC COLON
> 1366 ETHIOPIC PREFACE COLON
> 1367 ETHIOPIC QUESTION MARK
> 1368 ETHIOPIC PARAGRAPH SEPARATOR
> 166E CANADIAN SYLLABICS FULL STOP
> 1802 MONGOLIAN COMMA
> 1803 MONGOLIAN FULL STOP
> 1804 MONGOLIAN COLON
> 1808 MONGOLIAN MANCHU COMMA
> 1809 MONGOLIAN MANCHU FULL STOP
> 1944 LIMBU EXCLAMATION MARK
> 1945 LIMBU QUESTION MARK
>
> But I am not 100% about all the above characters. Should any of them be
> removed from the list (i.e., allowed in identifiers)?
>
> The following list includes all the non-Latin punctuation character which I
> feel not worth including in class <Pattern_Syntax>, because I think that,
> for a reason or another, they would cause no problem in identifiers:
>
> 055A ARMENIAN APOSTROPHE
> 055B ARMENIAN EMPHASIS MARK
> 055F ARMENIAN ABBREVIATION MARK
> 058A ARMENIAN HYPHEN
> 05BE HEBREW PUNCTUATION MAQAF
> 05C0 HEBREW PUNCTUATION PASEQ
> 05C3 HEBREW PUNCTUATION SOF PASUQ
> 05F3 HEBREW PUNCTUATION GERESH
> 05F4 HEBREW PUNCTUATION GERSHAYIM
> 066D ARABIC FIVE POINTED STAR
> 0700 SYRIAC END OF PARAGRAPH
> 0701 SYRIAC SUPRALINEAR FULL STOP
> 0702 SYRIAC SUBLINEAR FULL STOP
> 0703 SYRIAC SUPRALINEAR COLON
> 0704 SYRIAC SUBLINEAR COLON
> 0705 SYRIAC HORIZONTAL COLON
> 0706 SYRIAC COLON SKEWED LEFT
> 0707 SYRIAC COLON SKEWED RIGHT
> 0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT
> 0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT
> 070A SYRIAC CONTRACTION
> 070B SYRIAC HARKLEAN OBELUS
> 070C SYRIAC HARKLEAN METOBELUS
> 070D SYRIAC HARKLEAN ASTERISCUS
> 0970 DEVANAGARI ABBREVIATION SIGN
> 0DF4 SINHALA PUNCTUATION KUNDDALIYA
> 0E4F THAI CHARACTER FONGMAN
> 0E5A THAI CHARACTER ANGKHANKHU
> 0E5B THAI CHARACTER KHOMUT
> 0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
> 0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA
> 0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
> 0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
> 0F08 TIBETAN MARK SBRUL SHAD
> 0F09 TIBETAN MARK BSKUR YIG MGO
> 0F0A TIBETAN MARK BKA- SHOG YIG MGO
> 0F0B TIBETAN MARK INTERSYLLABIC TSHEG
> 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
> 0F0D TIBETAN MARK SHAD
> 0F0E TIBETAN MARK NYIS SHAD
> 0F0F TIBETAN MARK TSHEG SHAD
> 0F10 TIBETAN MARK NYIS TSHEG SHAD
> 0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
> 0F12 TIBETAN MARK RGYA GRAM SHAD
> 0F3A TIBETAN MARK GUG RTAGS GYON
> 0F3B TIBETAN MARK GUG RTAGS GYAS
> 0F3C TIBETAN MARK ANG KHANG GYON
> 0F3D TIBETAN MARK ANG KHANG GYAS
> 0F85 TIBETAN MARK PALUTA
> 104A MYANMAR SIGN LITTLE SECTION
> 104B MYANMAR SIGN SECTION
> 104C MYANMAR SYMBOL LOCATIVE
> 104D MYANMAR SYMBOL COMPLETED
> 104E MYANMAR SYMBOL AFOREMENTIONED
> 104F MYANMAR SYMBOL GENITIVE
> 1361 ETHIOPIC WORDSPACE
> 166D CANADIAN SYLLABICS CHI SIGN
> 169B OGHAM FEATHER MARK
> 169C OGHAM REVERSED FEATHER MARK
> 16EB RUNIC SINGLE PUNCTUATION
> 16EC RUNIC MULTIPLE PUNCTUATION
> 16ED RUNIC CROSS PUNCTUATION
> 1735 PHILIPPINE SINGLE PUNCTUATION
> 1736 PHILIPPINE DOUBLE PUNCTUATION
> 17D4 KHMER SIGN KHAN
> 17D5 KHMER SIGN BARIYOOSAN
> 17D6 KHMER SIGN CAMNUC PII KUUH
> 17D8 KHMER SIGN BEYYAL
> 17D9 KHMER SIGN PHNAEK MUAN
> 17DA KHMER SIGN KOOMUUT
> 1800 MONGOLIAN BIRGA
> 1801 MONGOLIAN ELLIPSIS
> 1805 MONGOLIAN FOUR DOTS
> 1806 MONGOLIAN TODO SOFT HYPHEN
> 1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
> 180A MONGOLIAN NIRUGU
> 10100 AEGEAN WORD SEPARATOR LINE
> 10101 AEGEAN WORD SEPARATOR DOT
> 1039F UGARITIC WORD DIVIDER
>
> Should any of the above character be added to <Pattern_Syntax> (i.e. *not*
> allowed in identifiers)?
>
> _ Marco
>
>

Next message: Jony Rosenne: "RE: Proposed Draft UTR #31 - Syntax Characters"
Previous message: Mark Davis: "Re: Proposed Draft UTR #31 - Syntax Characters"
In reply to: Marco Cimarosti: "RE: Proposed Draft UTR #31 - Syntax Characters"
Next in thread: Jim Allan: "RE: Proposed Draft UTR #31 - Syntax Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 12:29:42 EDT