Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Thu Sep 20 2007 - 12:11:35 CDT

Next message: Mark Davis: "Re: Normalization in panlingual application"

Previous message: Asmus Freytag: "Re: Normalization in panlingual application"
In reply to: Rick McGowan: "New Public Review Issue: Proposed Update UTS #18"
Next in thread: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
>
> http://www.unicode.org/reports/tr18/tr18-12.html
>
> This proposed update clarifies conformance requirements for "." and CRLF.
> Public feedback is invited.

I disagree with the MUSTs in the proposed text. In my implementation,
whether "." matches newline sequences is independent of "multiline
mode." Multiline mode affects the behavior of ^ and $, not .; in
single line mode, they match only at the beginning or end of the text
(or just before a final newline sequence); in multiline mode, ^ matches
at the beginning of the string or after any newline sequence, and $
matches before any newline sequence or at the end of the string.

You can turn on the DotMatchesNewline and MultilineMatching options
separately. As a side note, I implemented "." to match a default
grapheme cluster, so A + ACUTE is treated as a single entity, and
Hangul syllables are kept together (you can also specify them using
\L+\V+\T* if you want). There is also a DotMatchesDefective option
(true by default) which determines whether . will match a defective
combining character sequence (or you can look specifically for
defective sequences using \F).

> If you have comments for official UTC consideration, please post them by
> submitting your comments through our feedback & reporting page:
>
> http://www.unicode.org/reporting.html

A few months ago I reported a problem with UAX #18 using this page,
but I never received any confirmation other than that the web server
apparently accepted my message. The problem I reported was not
changed in this new update, so I don't have a lot of confidence in
this method of reporting problems. Here is what I submitted:

In Section 2.2 which discusses Default Grapheme Clusters, it says:

     A typical implementation of the inverse of a set containing
     literal clusters simply removes those strings, thus
     [^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ].

I think this is bad implementation advice, and leads to strange
behavior. In the example given, the behavior will be correct since
all of the clusters begin with a letter also contained in the class.
However, if you consider a character class containing only clusters,
e.g. [^\q{ch} \q{ll} \q{rr}], simply removing the clusters will
result in an empty character class that matches -anything-. This
is incorrect behavior as it should not match the beginning of the
word "chile" for instance.

The way I implemented this was to create a "normal" character class
containing all the listed characters and grapheme clusters, and
then invert the result of the match operation. The classes above
would match "chile" in the first position, and thus return a "no
match" result.

Mike

Next message: Mark Davis: "Re: Normalization in panlingual application"
Previous message: Asmus Freytag: "Re: Normalization in panlingual application"
In reply to: Rick McGowan: "New Public Review Issue: Proposed Update UTS #18"
Next in thread: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 12:16:04 CDT