From: Peter Kirk (email@example.com)
Date: Tue Aug 26 2003 - 05:36:31 EDT
On 26/08/2003 00:07, Jill.Ramonsky@Aculab.com wrote:
>I'm afraid that's not very practical, because, you see, if I have a
>hypothetical compiler for some hypothetical programming-language, and I
>download some source-code from the internet and try to complile it, I expect
>one of two things, either (1) it will compile cleanly, or (2) I will have to
>UPGRADE my compiler (or version of Unicode), after which it will compile
>I don't expect, however, to have to DOWNgrade my version of Unicode. And I
>can't be expected to store EVERY numbered version of Unicode on my machine.
>I prefer the idea that the list of allowed identifier characters increases
>with each version of Unicode (or equivalently, that a list of excluded
>characters decreases with each version of Unicode).
Agreed. I thought I had made this clear though perhaps some of the
clarification was off-list. My preference is for a list of syntax
(operator) characters which can be added to but not subtracted from.
This should avoid any need to downgrade.
I would also suggest that all punctuation characters and all undefined
characters be reserved i.e. they should not be used unquoted in strings
as they may be defined as syntax characters in later versions.
Implementations would not be obliged to check for misuse of these
reserved characters, it is up to the user to avoid them. (This kind of
loose syntax may not be ideal but it is common practice e.g. with HTML
which most browsers do not fully validate. An implementation would be
free to check against the list of reserved characters in the current UCD
if preferred.) But a guarantee could be made that characters currently
defined in Unicode as non-punctuation will never be defined as syntax
My suggestion is actually rather similar to what is already written in
UTR #31 section 4:
> With a fixed set of whitespace and syntax code points, a pattern
> language can then have a policy requiring all possible syntax
> characters (even ones currently unused) to be quoted if they are
> literals. By using this policy, it preserves the freedom to extend the
> syntax in the future by using those characters. Past patterns on
> future systems will always work; future patterns on past systems will
> signal an error instead of silently producing the wrong results.
The difference is that I am extending the list of possible syntax
characters to all punctuation characters. And perhaps a subset of these
theoretically possible syntax characters can be defined as the allowed
syntax characters in any one version of Unicode. But perhaps this isn't
necessary, as each pattern language can define and check for its own
subset as long as it only uses defined punctuation characters.
The reason why a change is needed is mainly to avoid the ethnocentric
definition of only Latin punctuation characters as valid syntax
characters. I also have also seen the serious problems which have
resulted from premature freezing of inappropriate properties e.g. the
combining classes of Hebrew points.
I am making these points in an official submission to the review process.
-- Peter Kirk firstname.lastname@example.org (personal) email@example.com (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Tue Aug 26 2003 - 06:38:21 EDT