Re: Proposed Draft UTR #31 - Syntax Characters

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Aug 22 2003 - 11:33:36 EDT

  • Next message: Jony Rosenne: "RE: Proposed Draft UTR #31 - Syntax Characters"

    Technical Report issues would be fine.

    I think #1 is worth considering. For #2, see other message to Peter Kirk.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
    To: <unicode@unicode.org>
    Sent: Friday, August 22, 2003 06:04
    Subject: RE: Proposed Draft UTR #31 - Syntax Characters

    > Rick McGowan wrote:
    > > the process as possible so that it can be considered
    > > The draft is found at http://www.unicode.org/reports/tr31/
    > > and feedback can be submitted as described there.
    >
    > (Before submitting official feedback, I'd like to discuss my comments here.
    > BTW, which "Type of Message" should I use in the feedback form? Is it OK to
    > use "Technical Report or Tech Note issues"?)
    >
    >
    > My two cents are both about adding characters in the <Pattern_Syntax> of
    > "4.1 Proposed Pattern Properties".
    >
    > IMHO:
    >
    > 1. Full-width, half-width, and "small" punctuation characters should
    > in class <Pattern_Syntax> as their "normal width" counterparts.
    >
    > 2. Non-Latin punctuation character should be in class
    > <Pattern_Syntax> as their Latin counterparts.
    >
    > The rationale for suggestion 1 is that <wide>, <narrow> and <small>
    > compatibility characters are substantially identical (in appearance and
    > function) to their "normal width" counterparts. A parser allowing an
    > unquoted full-width punctuation character in an identifier is guaranteed to
    > cause confusion to the user.
    >
    > E.g., consider the following expression:
    >
    > foo,bar
    >
    > To me, it *definitely* looks like two identifiers separated by a comma, and
    > I expect my parser to agree with me on this, even if the "comma" is actually
    > a full-width comma. I am not saying that the parser must necessarily accept
    > a full-width comma in that position: it is perfectly OK if the above
    > expression causes a syntax error such as: "Illegal character U+FF0C
    > (FULLWIDTH COMMA) after identifier <foo>'".
    >
    > But what the parser should absolutely *not* do, IMHO, is handling "foo,bar"
    > as a *single* identifier! Doing such a thing is guaranteed to cause troubles
    > to me. E.g., I might receive a puzzling error message saying: "Parameter
    > missing: this statement requires 2 parameters", while I can *see* that there
    > *are* two parameters: "foo" and "bar"...
    >
    > The rationale for suggestion 2 is very similar. E.g., the following
    > expression looks a perfectly legal C++ or Java statement:
    >
    > return;
    >
    > If the compiler tells me: "Undeclared identifier", I may get crazy for the
    > whole day trying to figure out what's going on... But if tells me "Illegal
    > character U+037E (GREEK QUESTION MARK) after keyword <return>", then I
    > immediately understand that something is wrong with that "semicolon".
    >
    > The reason I keep suggestions 1 and 2 separate is that, in the case of
    > <wide>, <narrow> and <small> compatibility characters, it is trivial to
    > determine the corresponding regular character, while in the case of
    > non-Latin punctuation there is room for discussing which punctuation
    > characters are similar enough (in function or appearance) to which Latin
    > punctuation character.
    >
    > For full-width, half-width, and "small" punctuation characters, my
    > suggestion is to add the following lines to "4.1 Proposed Pattern
    > Properties":
    >
    > FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
    > FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
    > MARK
    > FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
    > SIGN
    > FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
    > COMMERCIAL AT
    > FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
    > SOLIDUS
    > FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
    > AT
    > FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
    > BRACKET..FULLWIDTH GRAVE ACCENT
    > FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
    > BRACKET..FULLWIDTH TILDE
    > FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
    > PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
    > FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
    > FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
    > SIGN
    > FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
    > SIGN
    > FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
    > VERTICAL..HALFWIDTH WHITE CIRCLE
    >
    > For non-Latin punctuation characters, this is my tentative list of
    > characters that may cause trouble if used in identifiers, and which,
    > consequently, should be added to class <Pattern_Syntax>:
    >
    > 037E GREEK QUESTION MARK
    > 0387 GREEK ANO TELEIA
    > 055C ARMENIAN EXCLAMATION MARK
    > 055D ARMENIAN COMMA
    > 055E ARMENIAN QUESTION MARK
    > 0589 ARMENIAN FULL STOP
    > 060C ARABIC COMMA
    > 060D ARABIC DATE SEPARATOR
    > 061B ARABIC SEMICOLON
    > 061F ARABIC QUESTION MARK
    > 066A ARABIC PERCENT SIGN
    > 066B ARABIC DECIMAL SEPARATOR
    > 066C ARABIC THOUSANDS SEPARATOR
    > 06D4 ARABIC FULL STOP
    > 0964 DEVANAGARI DANDA
    > 0965 DEVANAGARI DOUBLE DANDA
    > 10FB GEORGIAN PARAGRAPH SEPARATOR
    > 1362 ETHIOPIC FULL STOP
    > 1363 ETHIOPIC COMMA
    > 1364 ETHIOPIC SEMICOLON
    > 1365 ETHIOPIC COLON
    > 1366 ETHIOPIC PREFACE COLON
    > 1367 ETHIOPIC QUESTION MARK
    > 1368 ETHIOPIC PARAGRAPH SEPARATOR
    > 166E CANADIAN SYLLABICS FULL STOP
    > 1802 MONGOLIAN COMMA
    > 1803 MONGOLIAN FULL STOP
    > 1804 MONGOLIAN COLON
    > 1808 MONGOLIAN MANCHU COMMA
    > 1809 MONGOLIAN MANCHU FULL STOP
    > 1944 LIMBU EXCLAMATION MARK
    > 1945 LIMBU QUESTION MARK
    >
    > But I am not 100% about all the above characters. Should any of them be
    > removed from the list (i.e., allowed in identifiers)?
    >
    > The following list includes all the non-Latin punctuation character which I
    > feel not worth including in class <Pattern_Syntax>, because I think that,
    > for a reason or another, they would cause no problem in identifiers:
    >
    > 055A ARMENIAN APOSTROPHE
    > 055B ARMENIAN EMPHASIS MARK
    > 055F ARMENIAN ABBREVIATION MARK
    > 058A ARMENIAN HYPHEN
    > 05BE HEBREW PUNCTUATION MAQAF
    > 05C0 HEBREW PUNCTUATION PASEQ
    > 05C3 HEBREW PUNCTUATION SOF PASUQ
    > 05F3 HEBREW PUNCTUATION GERESH
    > 05F4 HEBREW PUNCTUATION GERSHAYIM
    > 066D ARABIC FIVE POINTED STAR
    > 0700 SYRIAC END OF PARAGRAPH
    > 0701 SYRIAC SUPRALINEAR FULL STOP
    > 0702 SYRIAC SUBLINEAR FULL STOP
    > 0703 SYRIAC SUPRALINEAR COLON
    > 0704 SYRIAC SUBLINEAR COLON
    > 0705 SYRIAC HORIZONTAL COLON
    > 0706 SYRIAC COLON SKEWED LEFT
    > 0707 SYRIAC COLON SKEWED RIGHT
    > 0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT
    > 0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT
    > 070A SYRIAC CONTRACTION
    > 070B SYRIAC HARKLEAN OBELUS
    > 070C SYRIAC HARKLEAN METOBELUS
    > 070D SYRIAC HARKLEAN ASTERISCUS
    > 0970 DEVANAGARI ABBREVIATION SIGN
    > 0DF4 SINHALA PUNCTUATION KUNDDALIYA
    > 0E4F THAI CHARACTER FONGMAN
    > 0E5A THAI CHARACTER ANGKHANKHU
    > 0E5B THAI CHARACTER KHOMUT
    > 0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
    > 0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA
    > 0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
    > 0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
    > 0F08 TIBETAN MARK SBRUL SHAD
    > 0F09 TIBETAN MARK BSKUR YIG MGO
    > 0F0A TIBETAN MARK BKA- SHOG YIG MGO
    > 0F0B TIBETAN MARK INTERSYLLABIC TSHEG
    > 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
    > 0F0D TIBETAN MARK SHAD
    > 0F0E TIBETAN MARK NYIS SHAD
    > 0F0F TIBETAN MARK TSHEG SHAD
    > 0F10 TIBETAN MARK NYIS TSHEG SHAD
    > 0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
    > 0F12 TIBETAN MARK RGYA GRAM SHAD
    > 0F3A TIBETAN MARK GUG RTAGS GYON
    > 0F3B TIBETAN MARK GUG RTAGS GYAS
    > 0F3C TIBETAN MARK ANG KHANG GYON
    > 0F3D TIBETAN MARK ANG KHANG GYAS
    > 0F85 TIBETAN MARK PALUTA
    > 104A MYANMAR SIGN LITTLE SECTION
    > 104B MYANMAR SIGN SECTION
    > 104C MYANMAR SYMBOL LOCATIVE
    > 104D MYANMAR SYMBOL COMPLETED
    > 104E MYANMAR SYMBOL AFOREMENTIONED
    > 104F MYANMAR SYMBOL GENITIVE
    > 1361 ETHIOPIC WORDSPACE
    > 166D CANADIAN SYLLABICS CHI SIGN
    > 169B OGHAM FEATHER MARK
    > 169C OGHAM REVERSED FEATHER MARK
    > 16EB RUNIC SINGLE PUNCTUATION
    > 16EC RUNIC MULTIPLE PUNCTUATION
    > 16ED RUNIC CROSS PUNCTUATION
    > 1735 PHILIPPINE SINGLE PUNCTUATION
    > 1736 PHILIPPINE DOUBLE PUNCTUATION
    > 17D4 KHMER SIGN KHAN
    > 17D5 KHMER SIGN BARIYOOSAN
    > 17D6 KHMER SIGN CAMNUC PII KUUH
    > 17D8 KHMER SIGN BEYYAL
    > 17D9 KHMER SIGN PHNAEK MUAN
    > 17DA KHMER SIGN KOOMUUT
    > 1800 MONGOLIAN BIRGA
    > 1801 MONGOLIAN ELLIPSIS
    > 1805 MONGOLIAN FOUR DOTS
    > 1806 MONGOLIAN TODO SOFT HYPHEN
    > 1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
    > 180A MONGOLIAN NIRUGU
    > 10100 AEGEAN WORD SEPARATOR LINE
    > 10101 AEGEAN WORD SEPARATOR DOT
    > 1039F UGARITIC WORD DIVIDER
    >
    > Should any of the above character be added to <Pattern_Syntax> (i.e. *not*
    > allowed in identifiers)?
    >
    > _ Marco
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 12:29:42 EDT