Re: Proposed Draft UTR #31 - Syntax Characters

From: Peter Kirk (
Date: Wed Aug 20 2003 - 19:21:07 EDT

  • Next message: Jim Allan: "Re: Hexadecimal never again"

    On 20/08/2003 11:23, Rick McGowan wrote:

    >This notice is relevant to anyone dealing with programming languages, query
    >specifications, regular expressions, scripting languages, and similar domains.
    >The Proposed Draft UTR #31: Identifier and Pattern Syntax will be discussed at
    >the UTC meeting next week. Part of that document (Section 4) is a proposal for
    >two new immutable properties, Pattern_White_Space and Pattern_Syntax. As
    >immutable properties, these would not ever change once they are introduced into
    >the standard, so it is important to get feedback on their contents beforehand.
    >The UTC will not be making a final determination on these properties at this
    >meeting, but it is important that any feedback on them is supplied as early in
    >the process as possible so that it can be considered thoroughly. The draft is
    >found at and feedback can be submitted as
    >described there.
    > Rick McGowan
    > Unicode, Inc.
    I'm a little concerned at the implications of counting zero width
    characters like LRM and RLM as white space. They can easily find their
    way unnoticed into the middle of patterns e.g. when copying from a text
    which has added these characters to ensure correct directionality. I
    wonder if it might be better to add a new category of ignored
    characters, such that one of these found on its own doesn't count as a
    separator but it is ignored i.e. treated as part of the white space if
    found adjacent to white space. Of course the details of this need a
    little more thought, e.g. does one of these actually count as part of
    the pattern, but I hope you see what I am getting at.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Aug 20 2003 - 20:02:31 EDT