Re: Proposed Draft UTR #31 - Syntax Characters

From: Peter Kirk (
Date: Fri Aug 22 2003 - 15:47:59 EDT

  • Next message: Jony Rosenne: "RE: Proposed Draft UTR #31 - Syntax Characters"

    On 22/08/2003 11:29, Marco Cimarosti wrote:

    >Mark Davis wrote:
    >>Technical Report issues would be fine.
    >>I think #1 is worth considering. For #2, see other message to
    >>Peter Kirk.
    >I agree with your statement: "The purpose of the Pattern Syntax characters
    >is *not* to list everything that is a symbol or punctuation mark". But that
    >is what Kirk suggested, not what I proposed.
    >I proposed to exclude a *limited* set of script-specific punctuation that
    >*might* be confused with punctuation characters normally used in the syntax
    >of computer languages, either because they look identical, or because they
    >are perceived culturally as "another form of the same character".
    >E.g., I kept out from the list everything belonging to ancient scripts
    >(who's going to write programs in Linear B!?) and anything that I suspected
    >would be valid inside a word or expression: hyphens, emphasis markers,
    >ellipsis marks, etc.
    >You said that the list of ranges must be invariable, but I doubt that we
    >will many new *modern* and *commonly* used punctuation marks in future
    >versions, so think that this requirement for invariability can reasonably be
    >I already made the example of the Greek question mark which may be mistaken
    >for a semicolon. That is *not* an unlikely situation: if a Greek programmer
    >has his keyboard in Greek mode (because he just finished typing an
    >identifier containing Greek letters) he may well forget to turn it to Latin
    >mode before typing the trailing semicolon.
    >Similarly, due to the fact that some punctuation characters (parentheses,
    >etc.) are mirrored in a RTL context, an Arab programmer may think that "؟"
    >is just an alternate RTL glyph for "?", so he may be puzzled by apparently
    >absurd error messages.
    >E.g., he types:
    > foo?bar
    >And the system calls routine "foo" passing variable "bar" to it. ("?" is the
    >"call" operator of this hypothetical programming language).
    >So, he switches to Arabic mode and types:
    > فو؟بار
    >But the system says: "Undeclared identifier". But he is *sure* that he did
    >declare a routine named "فو", and a variable named "بار", so what's going
    >on? If the system said instead: "Character '؟' is not a legal operator",
    >everything would be much clearer.
    >_ Marco
    Well, the situation with Hebrew sof pasuq is almost identical to that
    for Greek and Arabic question marks, except that it is functionally a
    full stop not a question mark, so I can't see any reason other than
    prejudice for omitting it from the list.

    Similarly, Hebrew geresh and gershayim look like quotation marks and are
    used interchangeably in legacy encodings, the same with maqaf and hyphen
    - maqaf is very much the cultural equivalent of hyphen, and I have seen
    recent discussion about whether the hyphen key on a Hebrew keyboard
    ought actually to generate a maqaf. As an ordinary Latin hyphen is
    already in the list, by your argument there is no reason to exclude
    other things that look like it and function like it.

    I'm not talking about biblical Hebrew here, I'm talking about a living
    modern language. But one which was once almost dead and was revived.
    More recently, Tifinagh was an obsolescent oddity but now looks like
    becoming the standard script for half the population of Morocco and
    maybe then of several neighbouring countries - tens of millions of
    potential users of the script. That implies that we may need to add
    Tifinagh punctuation to the list, so we can't make the list unchangeable
    until Tifinagh is defined. I point this out to sound a caution even
    about excluding ancient scripts from a list which is supposed to be
    unchangeable for the life of the Unicode standard.

    So it seems to me that we should either restrict the list to Latin, and
    make that very explicit with a mechanism for defining alternative sets
    of symbols for other scripts, or else extend the list to include all
    punctuation and to allow as yet undefined characters to be added to it.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Fri Aug 22 2003 - 16:42:36 EDT