RE: Proposed Draft UTR #31 - Syntax Characters

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Aug 22 2003 - 09:04:21 EDT

  • Next message: Jim Allan: "RE: Proposed Draft UTR #31 - Syntax Characters"

    Rick McGowan wrote:
    > the process as possible so that it can be considered
    > The draft is found at http://www.unicode.org/reports/tr31/
    > and feedback can be submitted as described there.

    (Before submitting official feedback, I'd like to discuss my comments here.
    BTW, which "Type of Message" should I use in the feedback form? Is it OK to
    use "Technical Report or Tech Note issues"?)

    My two cents are both about adding characters in the <Pattern_Syntax> of
    "4.1 Proposed Pattern Properties".

    IMHO:

            1. Full-width, half-width, and "small" punctuation characters should
    in class <Pattern_Syntax> as their "normal width" counterparts.

            2. Non-Latin punctuation character should be in class
    <Pattern_Syntax> as their Latin counterparts.

    The rationale for suggestion 1 is that <wide>, <narrow> and <small>
    compatibility characters are substantially identical (in appearance and
    function) to their "normal width" counterparts. A parser allowing an
    unquoted full-width punctuation character in an identifier is guaranteed to
    cause confusion to the user.
     
    E.g., consider the following expression:

            foo,bar

    To me, it *definitely* looks like two identifiers separated by a comma, and
    I expect my parser to agree with me on this, even if the "comma" is actually
    a full-width comma. I am not saying that the parser must necessarily accept
    a full-width comma in that position: it is perfectly OK if the above
    expression causes a syntax error such as: "Illegal character U+FF0C
    (FULLWIDTH COMMA) after identifier <foo>'".

    But what the parser should absolutely *not* do, IMHO, is handling "foo,bar"
    as a *single* identifier! Doing such a thing is guaranteed to cause troubles
    to me. E.g., I might receive a puzzling error message saying: "Parameter
    missing: this statement requires 2 parameters", while I can *see* that there
    *are* two parameters: "foo" and "bar"...

    The rationale for suggestion 2 is very similar. E.g., the following
    expression looks a perfectly legal C++ or Java statement:

            return;

    If the compiler tells me: "Undeclared identifier", I may get crazy for the
    whole day trying to figure out what's going on... But if tells me "Illegal
    character U+037E (GREEK QUESTION MARK) after keyword <return>", then I
    immediately understand that something is wrong with that "semicolon".

    The reason I keep suggestions 1 and 2 separate is that, in the case of
    <wide>, <narrow> and <small> compatibility characters, it is trivial to
    determine the corresponding regular character, while in the case of
    non-Latin punctuation there is room for discussing which punctuation
    characters are similar enough (in function or appearance) to which Latin
    punctuation character.

    For full-width, half-width, and "small" punctuation characters, my
    suggestion is to add the following lines to "4.1 Proposed Pattern
    Properties":

            FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
            FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
    MARK
            FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
    SIGN
            FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
    COMMERCIAL AT
            FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
    SOLIDUS
            FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
    AT
            FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
    BRACKET..FULLWIDTH GRAVE ACCENT
            FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
    BRACKET..FULLWIDTH TILDE
            FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
    PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
            FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
            FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
    SIGN
            FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
    SIGN
            FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
    VERTICAL..HALFWIDTH WHITE CIRCLE

    For non-Latin punctuation characters, this is my tentative list of
    characters that may cause trouble if used in identifiers, and which,
    consequently, should be added to class <Pattern_Syntax>:

            037E GREEK QUESTION MARK
            0387 GREEK ANO TELEIA
            055C ARMENIAN EXCLAMATION MARK
            055D ARMENIAN COMMA
            055E ARMENIAN QUESTION MARK
            0589 ARMENIAN FULL STOP
            060C ARABIC COMMA
            060D ARABIC DATE SEPARATOR
            061B ARABIC SEMICOLON
            061F ARABIC QUESTION MARK
            066A ARABIC PERCENT SIGN
            066B ARABIC DECIMAL SEPARATOR
            066C ARABIC THOUSANDS SEPARATOR
            06D4 ARABIC FULL STOP
            0964 DEVANAGARI DANDA
            0965 DEVANAGARI DOUBLE DANDA
            10FB GEORGIAN PARAGRAPH SEPARATOR
            1362 ETHIOPIC FULL STOP
            1363 ETHIOPIC COMMA
            1364 ETHIOPIC SEMICOLON
            1365 ETHIOPIC COLON
            1366 ETHIOPIC PREFACE COLON
            1367 ETHIOPIC QUESTION MARK
            1368 ETHIOPIC PARAGRAPH SEPARATOR
            166E CANADIAN SYLLABICS FULL STOP
            1802 MONGOLIAN COMMA
            1803 MONGOLIAN FULL STOP
            1804 MONGOLIAN COLON
            1808 MONGOLIAN MANCHU COMMA
            1809 MONGOLIAN MANCHU FULL STOP
            1944 LIMBU EXCLAMATION MARK
            1945 LIMBU QUESTION MARK

    But I am not 100% about all the above characters. Should any of them be
    removed from the list (i.e., allowed in identifiers)?

    The following list includes all the non-Latin punctuation character which I
    feel not worth including in class <Pattern_Syntax>, because I think that,
    for a reason or another, they would cause no problem in identifiers:

            055A ARMENIAN APOSTROPHE
            055B ARMENIAN EMPHASIS MARK
            055F ARMENIAN ABBREVIATION MARK
            058A ARMENIAN HYPHEN
            05BE HEBREW PUNCTUATION MAQAF
            05C0 HEBREW PUNCTUATION PASEQ
            05C3 HEBREW PUNCTUATION SOF PASUQ
            05F3 HEBREW PUNCTUATION GERESH
            05F4 HEBREW PUNCTUATION GERSHAYIM
            066D ARABIC FIVE POINTED STAR
            0700 SYRIAC END OF PARAGRAPH
            0701 SYRIAC SUPRALINEAR FULL STOP
            0702 SYRIAC SUBLINEAR FULL STOP
            0703 SYRIAC SUPRALINEAR COLON
            0704 SYRIAC SUBLINEAR COLON
            0705 SYRIAC HORIZONTAL COLON
            0706 SYRIAC COLON SKEWED LEFT
            0707 SYRIAC COLON SKEWED RIGHT
            0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT
            0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT
            070A SYRIAC CONTRACTION
            070B SYRIAC HARKLEAN OBELUS
            070C SYRIAC HARKLEAN METOBELUS
            070D SYRIAC HARKLEAN ASTERISCUS
            0970 DEVANAGARI ABBREVIATION SIGN
            0DF4 SINHALA PUNCTUATION KUNDDALIYA
            0E4F THAI CHARACTER FONGMAN
            0E5A THAI CHARACTER ANGKHANKHU
            0E5B THAI CHARACTER KHOMUT
            0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
            0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA
            0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
            0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
            0F08 TIBETAN MARK SBRUL SHAD
            0F09 TIBETAN MARK BSKUR YIG MGO
            0F0A TIBETAN MARK BKA- SHOG YIG MGO
            0F0B TIBETAN MARK INTERSYLLABIC TSHEG
            0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
            0F0D TIBETAN MARK SHAD
            0F0E TIBETAN MARK NYIS SHAD
            0F0F TIBETAN MARK TSHEG SHAD
            0F10 TIBETAN MARK NYIS TSHEG SHAD
            0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
            0F12 TIBETAN MARK RGYA GRAM SHAD
            0F3A TIBETAN MARK GUG RTAGS GYON
            0F3B TIBETAN MARK GUG RTAGS GYAS
            0F3C TIBETAN MARK ANG KHANG GYON
            0F3D TIBETAN MARK ANG KHANG GYAS
            0F85 TIBETAN MARK PALUTA
            104A MYANMAR SIGN LITTLE SECTION
            104B MYANMAR SIGN SECTION
            104C MYANMAR SYMBOL LOCATIVE
            104D MYANMAR SYMBOL COMPLETED
            104E MYANMAR SYMBOL AFOREMENTIONED
            104F MYANMAR SYMBOL GENITIVE
            1361 ETHIOPIC WORDSPACE
            166D CANADIAN SYLLABICS CHI SIGN
            169B OGHAM FEATHER MARK
            169C OGHAM REVERSED FEATHER MARK
            16EB RUNIC SINGLE PUNCTUATION
            16EC RUNIC MULTIPLE PUNCTUATION
            16ED RUNIC CROSS PUNCTUATION
            1735 PHILIPPINE SINGLE PUNCTUATION
            1736 PHILIPPINE DOUBLE PUNCTUATION
            17D4 KHMER SIGN KHAN
            17D5 KHMER SIGN BARIYOOSAN
            17D6 KHMER SIGN CAMNUC PII KUUH
            17D8 KHMER SIGN BEYYAL
            17D9 KHMER SIGN PHNAEK MUAN
            17DA KHMER SIGN KOOMUUT
            1800 MONGOLIAN BIRGA
            1801 MONGOLIAN ELLIPSIS
            1805 MONGOLIAN FOUR DOTS
            1806 MONGOLIAN TODO SOFT HYPHEN
            1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER
            180A MONGOLIAN NIRUGU
            10100 AEGEAN WORD SEPARATOR LINE
            10101 AEGEAN WORD SEPARATOR DOT
            1039F UGARITIC WORD DIVIDER

    Should any of the above character be added to <Pattern_Syntax> (i.e. *not*
    allowed in identifiers)?

    _ Marco



    This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 10:12:39 EDT