From: Martin v. Löwis (martin@v.loewis.de)
Date: Wed Aug 15 2007 - 06:25:13 CDT
> The characters that cause problems are not a fixed list; they are
> programmatically detected and removed when the data is derived. This is
> done by looking at the NFKC version of each character that qualifies as
> part of an identifier (according to that particular version of Unicode)
> and removing the character from XID when the result would not be
> consistent. That is,
>
> 1. when the character is XID_Start, the NFKC sequence has to be
> <XID_Start XID_Continue*> or the character is removed
> 2. when the character is XID_Continue, the NFKC sequence has to be
> <XID_Continue+> or the character is removed.
>
> Particular characters may be re-added (grandfathered) to ID to make the
> definition be backwards compatible.
>
> Does that help?
Almost. I find the version that Asmus gave more precise, as it mentions
that U+00B7 is added to XID_Continue as the first step. Also, it's
puzzling to read that characters may be added to ID - this is a derived
property, so you can't add to it (although you can add to Other_ID_Start
and Other_ID_Continue, which aren't derived).
In any case, I would appreciate if the precise rules for the computation
of derived properties were prominently published. For ID_Start, UAX#31
is given as a reference, yet UAX#31 does not precisely explain how it
is computed (in particular, it mentions the term "stability extensions"
twice without ever defining it). Fortunately, DerivedCoreProperties.txt
does give a precise specification of ID_Start, namely as
Generated from Lu+Ll+Lt+Lm+Lo+Nl+Other_ID_Start
The same is not true for XID_Start (which is, surprisingly, specified
in the same line as ID_Start, with the same "General Description of
Coverage", yet it is clear from reading that they are meant to be
different). If the above formula (or Asmus' version of it) could be
added, that would be much appreciated.
In case you wonder why I'm so nit-picking about it: I just specified
the syntax of non-ASCII identifiers for Python [1], and found the
description of XID_Start and XID_Continue so confusing that I first
ignored it, and went for ID_Start/Continue instead.
Now that I know the definition, I found it easier to copy the explicit
list of XID_Start/Continue characters into the Python parser instead
of trying to derive these properties at run-time.
Regards,
Martin
[1] http://www.python.org/dev/peps/pep-3131/
This archive was generated by hypermail 2.1.5 : Wed Aug 15 2007 - 06:27:27 CDT