Re: New Public Review Issue: Proposed Update UTS #18

From: Asmus Freytag (
Date: Fri Oct 05 2007 - 12:01:04 CDT

  • Next message: Andy Heninger: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

    On 10/4/2007 8:44 PM, Mike wrote:
    >>>> In addition, the meaning of ranges in sets like [a-z] should also be
    >>>> consistant with the collation used...
    >>> I disagree with this. I think that having [a-z] magically
    >>> mean all characters in a particular language is asking for
    >>> trouble. In French, would you say that [a-z] should match
    >>> C WITH CEDILLA or A + ACUTE?
    >> Having that kind of support allows regexes to be written that match,
    >> say the top half of a list
    >> by using [a-k] etc. That's something that you can do in English
    >> today, but not in any other
    >> language. You need to decide whether extending regexs to other
    >> languages should allow
    >> such uses (in which case you think of collation elements and sorting
    >> order) or not.
    >> Depending on how many accented letters a language uses, writing the
    >> equivalent expression manually can be both tedious and error-prone.
    > The reason I think that [a-z] should only match the 26 code points
    > is that regular expressions are often used to match things like
    > domain name parts: [a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])? where
    > the allowed characters do not change depending on locale.
    well, you would want to be very clear what locale you are using,
    including the 'neutral' or in POSIX terms, the "C" locale, which is the
    one to use for such identifiers...
    > I agree that having an easy way to say "match any Swedish character",
    > or some range of the characters, would be useful; maybe this could be
    > done using something similar to the \p{} syntax for properties? I
    > don't want to propose anything since I haven't studied it enough yet.
    >>> It's my opinion that ranges inside [] should be simple binary
    >>> order. If you want to do anything fancier, there should be
    >>> new syntax for it.
    >> That, or an option?
    > I would be ok with it being an option.
    And the option would have to specify *which* locale.
    >> Now, other than for canonical decompositions (and conjoining Jamo),
    >> I've not seen an example that informs me of why it is useful for a
    >> regex package to be able to match 'ch' as if it were a single code
    >> point. Can somebody please present a simple example that shows an
    >> important use case that can't be realized if regexes are limited to a
    >> single character (plus *canonical* equivalents).
    > I don't know the reason -- I just implemented all the features
    > required for level 1 and level 2 conformance, and part of level 2
    > is being able to do this.
    I'm still waiting to hear from anyone else on a rationale for this.
    >> After all, the atomic elements for writing would be the 'c' and 'h',
    >> it is only for the purpose of some other text operations that 'ch'
    >> are (sometimes) considered a unit.
    > I used to be fluent in written Spanish, but despite that, I never
    > considered ch, ll, or rr to be single characters. I think I did
    > a Spanish crossword once where ch went into a single square.
    Right. If users know to type more than one character to get the cluster
    then it would be normal to reflect that in the regexp by its atoms. Only
    if you need to search the dictionary for '5-letter words' where "ch" is
    a letter, would you need such a feature - and I can't really see a use
    case for it.

    Furthermore, int many languages where there are clusters like that, they
    are not absolute. In Danish, "AA" is often present because of being the
    older spelling of what is now written with A with ring, but in compounds
    like "dataanalyse" (data analysis) its presence is accidental.

    \q(aa) could misfire badly in all Danish texts unless they were prepped
    by inserting a SHY in between the two a's at a compound word boundary.

    > Mike

    This archive was generated by hypermail 2.1.5 : Fri Oct 05 2007 - 12:49:46 CDT