Re: Specification for XID_Start and XID_Continue

From: Martin v. Löwis (
Date: Wed Aug 15 2007 - 06:25:13 CDT

  • Next message: Mark Davis: "Re: Specification for XID_Start and XID_Continue"

    > The characters that cause problems are not a fixed list; they are
    > programmatically detected and removed when the data is derived. This is
    > done by looking at the NFKC version of each character that qualifies as
    > part of an identifier (according to that particular version of Unicode)
    > and removing the character from XID when the result would not be
    > consistent. That is,
    > 1. when the character is XID_Start, the NFKC sequence has to be
    > <XID_Start XID_Continue*> or the character is removed
    > 2. when the character is XID_Continue, the NFKC sequence has to be
    > <XID_Continue+> or the character is removed.
    > Particular characters may be re-added (grandfathered) to ID to make the
    > definition be backwards compatible.
    > Does that help?

    Almost. I find the version that Asmus gave more precise, as it mentions
    that U+00B7 is added to XID_Continue as the first step. Also, it's
    puzzling to read that characters may be added to ID - this is a derived
    property, so you can't add to it (although you can add to Other_ID_Start
    and Other_ID_Continue, which aren't derived).

    In any case, I would appreciate if the precise rules for the computation
    of derived properties were prominently published. For ID_Start, UAX#31
    is given as a reference, yet UAX#31 does not precisely explain how it
    is computed (in particular, it mentions the term "stability extensions"
    twice without ever defining it). Fortunately, DerivedCoreProperties.txt
    does give a precise specification of ID_Start, namely as

       Generated from Lu+Ll+Lt+Lm+Lo+Nl+Other_ID_Start

    The same is not true for XID_Start (which is, surprisingly, specified
    in the same line as ID_Start, with the same "General Description of
    Coverage", yet it is clear from reading that they are meant to be
    different). If the above formula (or Asmus' version of it) could be
    added, that would be much appreciated.

    In case you wonder why I'm so nit-picking about it: I just specified
    the syntax of non-ASCII identifiers for Python [1], and found the
    description of XID_Start and XID_Continue so confusing that I first
    ignored it, and went for ID_Start/Continue instead.

    Now that I know the definition, I found it easier to copy the explicit
    list of XID_Start/Continue characters into the Python parser instead
    of trying to derive these properties at run-time.



    This archive was generated by hypermail 2.1.5 : Wed Aug 15 2007 - 06:27:27 CDT