Re: Proposed Draft UTR #31 - Syntax Characters

From: Peter Kirk (
Date: Tue Aug 26 2003 - 05:36:31 EDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Faulty ligatures in Adobe PhotoShop"

    On 26/08/2003 00:07, wrote:

    >I'm afraid that's not very practical, because, you see, if I have a
    >hypothetical compiler for some hypothetical programming-language, and I
    >download some source-code from the internet and try to complile it, I expect
    >one of two things, either (1) it will compile cleanly, or (2) I will have to
    >UPGRADE my compiler (or version of Unicode), after which it will compile
    >I don't expect, however, to have to DOWNgrade my version of Unicode. And I
    >can't be expected to store EVERY numbered version of Unicode on my machine.
    >I prefer the idea that the list of allowed identifier characters increases
    >with each version of Unicode (or equivalently, that a list of excluded
    >characters decreases with each version of Unicode).
    Agreed. I thought I had made this clear though perhaps some of the
    clarification was off-list. My preference is for a list of syntax
    (operator) characters which can be added to but not subtracted from.
    This should avoid any need to downgrade.

    I would also suggest that all punctuation characters and all undefined
    characters be reserved i.e. they should not be used unquoted in strings
    as they may be defined as syntax characters in later versions.
    Implementations would not be obliged to check for misuse of these
    reserved characters, it is up to the user to avoid them. (This kind of
    loose syntax may not be ideal but it is common practice e.g. with HTML
    which most browsers do not fully validate. An implementation would be
    free to check against the list of reserved characters in the current UCD
    if preferred.) But a guarantee could be made that characters currently
    defined in Unicode as non-punctuation will never be defined as syntax

    My suggestion is actually rather similar to what is already written in
    UTR #31 section 4:

    > With a fixed set of whitespace and syntax code points, a pattern
    > language can then have a policy requiring all possible syntax
    > characters (even ones currently unused) to be quoted if they are
    > literals. By using this policy, it preserves the freedom to extend the
    > syntax in the future by using those characters. Past patterns on
    > future systems will always work; future patterns on past systems will
    > signal an error instead of silently producing the wrong results.
    The difference is that I am extending the list of possible syntax
    characters to all punctuation characters. And perhaps a subset of these
    theoretically possible syntax characters can be defined as the allowed
    syntax characters in any one version of Unicode. But perhaps this isn't
    necessary, as each pattern language can define and check for its own
    subset as long as it only uses defined punctuation characters.

    The reason why a change is needed is mainly to avoid the ethnocentric
    definition of only Latin punctuation characters as valid syntax
    characters. I also have also seen the serious problems which have
    resulted from premature freezing of inappropriate properties e.g. the
    combining classes of Hebrew points.

    I am making these points in an official submission to the review process.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Tue Aug 26 2003 - 06:38:21 EDT