Re: Proposed Draft UTR #31 - Syntax Characters

From: Mark Davis (
Date: Thu Aug 21 2003 - 11:38:17 EDT

  • Next message: Paul James Cowie: "Re: Character codes for Egyptian transliteration"

    I suspect your distinction is a bit too subtle to be useful. Having, for
    example, a RLM only have affect when adjacent to a space in a regular expression
    would be pretty prone to error; expecially since the character would be

    The reason for allowing LRM and RLM is to be able to make patterns readable. If
    you have some syntax like
    (where the uppercase represents Hebrew), then bidi display of the neutrals
    renders the pattern almost completely illegible. Inserting LRMs or RLMs at
    appropriate points straightens out the display. In a special "pattern UI", one
    could override the (or some) neutrals to have a strong direction, but most
    patterns are viewed and edited in plaintext editors.

    My recommendation for pattern syntax would be to quote all
    Default_Ignorable_Code_Points if they are actually to be part of literals.
    Otherwise the maintanence of such regular expressions (or queries, or rules,
    etc.) becomes quite difficult, since the DICP are invisible by default.

    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Peter Kirk" <>
    To: "Rick McGowan" <>
    Cc: <>
    Sent: Wednesday, August 20, 2003 16:21
    Subject: Re: Proposed Draft UTR #31 - Syntax Characters

    > On 20/08/2003 11:23, Rick McGowan wrote:
    > >This notice is relevant to anyone dealing with programming languages, query
    > >specifications, regular expressions, scripting languages, and similar
    > >
    > >The Proposed Draft UTR #31: Identifier and Pattern Syntax will be discussed
    > >the UTC meeting next week. Part of that document (Section 4) is a proposal
    > >two new immutable properties, Pattern_White_Space and Pattern_Syntax. As
    > >immutable properties, these would not ever change once they are introduced
    > >the standard, so it is important to get feedback on their contents
    > >
    > >The UTC will not be making a final determination on these properties at this
    > >meeting, but it is important that any feedback on them is supplied as early
    > >the process as possible so that it can be considered thoroughly. The draft is
    > >found at and feedback can be submitted
    > >described there.
    > >
    > >Regards,
    > > Rick McGowan
    > > Unicode, Inc.
    > >
    > >
    > >
    > >
    > >
    > >
    > I'm a little concerned at the implications of counting zero width
    > characters like LRM and RLM as white space. They can easily find their
    > way unnoticed into the middle of patterns e.g. when copying from a text
    > which has added these characters to ensure correct directionality. I
    > wonder if it might be better to add a new category of ignored
    > characters, such that one of these found on its own doesn't count as a
    > separator but it is ignored i.e. treated as part of the white space if
    > found adjacent to white space. Of course the details of this need a
    > little more thought, e.g. does one of these actually count as part of
    > the pattern, but I hope you see what I am getting at.
    > --
    > Peter Kirk
    > (personal)
    > (work)

    This archive was generated by hypermail 2.1.5 : Thu Aug 21 2003 - 12:50:01 EDT