From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Aug 21 2003 - 11:38:17 EDT
I suspect your distinction is a bit too subtle to be useful. Having, for
example, a RLM only have affect when adjacent to a space in a regular expression
would be pretty prone to error; expecially since the character would be
invisible.
The reason for allowing LRM and RLM is to be able to make patterns readable. If
you have some syntax like
/AB?(\p{letter})*A\n.../
(where the uppercase represents Hebrew), then bidi display of the neutrals
renders the pattern almost completely illegible. Inserting LRMs or RLMs at
appropriate points straightens out the display. In a special "pattern UI", one
could override the (or some) neutrals to have a strong direction, but most
patterns are viewed and edited in plaintext editors.
My recommendation for pattern syntax would be to quote all
Default_Ignorable_Code_Points if they are actually to be part of literals.
Otherwise the maintanence of such regular expressions (or queries, or rules,
etc.) becomes quite difficult, since the DICP are invisible by default.
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
From: "Peter Kirk" <peterkirk@qaya.org>
To: "Rick McGowan" <rick@unicode.org>
Cc: <unicode@unicode.org>
Sent: Wednesday, August 20, 2003 16:21
Subject: Re: Proposed Draft UTR #31 - Syntax Characters
> On 20/08/2003 11:23, Rick McGowan wrote:
>
> >This notice is relevant to anyone dealing with programming languages, query
> >specifications, regular expressions, scripting languages, and similar
domains.
> >
> >The Proposed Draft UTR #31: Identifier and Pattern Syntax will be discussed
at
> >the UTC meeting next week. Part of that document (Section 4) is a proposal
for
> >two new immutable properties, Pattern_White_Space and Pattern_Syntax. As
> >immutable properties, these would not ever change once they are introduced
into
> >the standard, so it is important to get feedback on their contents
beforehand.
> >
> >The UTC will not be making a final determination on these properties at this
> >meeting, but it is important that any feedback on them is supplied as early
in
> >the process as possible so that it can be considered thoroughly. The draft is
> >found at http://www.unicode.org/reports/tr31/ and feedback can be submitted
as
> >described there.
> >
> >Regards,
> > Rick McGowan
> > Unicode, Inc.
> >
> >
> >
> >
> >
> >
> I'm a little concerned at the implications of counting zero width
> characters like LRM and RLM as white space. They can easily find their
> way unnoticed into the middle of patterns e.g. when copying from a text
> which has added these characters to ensure correct directionality. I
> wonder if it might be better to add a new category of ignored
> characters, such that one of these found on its own doesn't count as a
> separator but it is ignored i.e. treated as part of the white space if
> found adjacent to white space. Of course the details of this need a
> little more thought, e.g. does one of these actually count as part of
> the pattern, but I hope you see what I am getting at.
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Aug 21 2003 - 12:50:01 EDT