Re: Specification for XID_Start and XID_Continue

From: Mark Davis (
Date: Tue Aug 14 2007 - 20:20:12 CDT

  • Next message: Eric Muller: "Champollion museum"

    The characters that cause problems are not a fixed list; they are
    programmatically detected and removed when the data is derived. This is done
    by looking at the NFKC version of each character that qualifies as part of
    an identifier (according to that particular version of Unicode) and removing
    the character from XID when the result would not be consistent. That is,

       1. when the character is XID_Start, the NFKC sequence has to be
       <XID_Start XID_Continue*> or the character is removed
       2. when the character is XID_Continue, the NFKC sequence has to be
       <XID_Continue+> or the character is removed.

    Particular characters may be re-added (grandfathered) to ID to make the
    definition be backwards compatible.

    Does that help?


    On 8/14/07, "Martin v. Löwis" <> wrote:
    > I'm trying to locate the precise specification for the
    > XID_Start and XID_Continue properties. According to
    > they are derived properties, so there should be an
    > algorithm somewhere describing how the are computed
    > (given other properties). The UCD says that the
    > specification is in UAX#31, which says I should
    > read
    > However, looking at 5.1, I cannot find a precise
    > specification of these properties. For example,
    > 5.1.2 says "Certain characters...", but does not
    > seem to provide a complete list of such characters.
    > It ends with "In particular, the following four
    > characters...". Again, that reads like an example -
    > is it meant as a complete specification?
    > Likewise, 5.1.3 talks about "certain Arabic presentation
    > forms", without giving a complete list which precisely
    > are excluded from XID_Start and XID_Continue.
    > Any insights appreciated,
    > Martin


    This archive was generated by hypermail 2.1.5 : Tue Aug 14 2007 - 20:22:31 CDT