Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Thu Oct 04 2007 - 22:44:43 CDT

  • Next message: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

    >>> In addition, the meaning of ranges in sets like [a-z] should also be
    >>> consistant with the collation used...
    >>
    >> I disagree with this. I think that having [a-z] magically
    >> mean all characters in a particular language is asking for
    >> trouble. In French, would you say that [a-z] should match
    >> C WITH CEDILLA or A + ACUTE?
    > Having that kind of support allows regexes to be written that match, say
    > the top half of a list
    > by using [a-k] etc. That's something that you can do in English today,
    > but not in any other
    > language. You need to decide whether extending regexs to other languages
    > should allow
    > such uses (in which case you think of collation elements and sorting
    > order) or not.
    >
    > Depending on how many accented letters a language uses, writing the
    > equivalent expression manually can be both tedious and error-prone.

    The reason I think that [a-z] should only match the 26 code points
    is that regular expressions are often used to match things like
    domain name parts: [a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])? where
    the allowed characters do not change depending on locale.

    I agree that having an easy way to say "match any Swedish character",
    or some range of the characters, would be useful; maybe this could be
    done using something similar to the \p{} syntax for properties? I
    don't want to propose anything since I haven't studied it enough yet.

    >> It's my opinion that ranges inside [] should be simple binary
    >> order. If you want to do anything fancier, there should be
    >> new syntax for it.
    > That, or an option?

    I would be ok with it being an option.

    > Now, other than for canonical decompositions (and conjoining Jamo), I've
    > not seen an example that informs me of why it is useful for a regex
    > package to be able to match 'ch' as if it were a single code point. Can
    > somebody please present a simple example that shows an important use
    > case that can't be realized if regexes are limited to a single character
    > (plus *canonical* equivalents).

    I don't know the reason -- I just implemented all the features
    required for level 1 and level 2 conformance, and part of level 2
    is being able to do this.

    > After all, the atomic elements for writing would be the 'c' and 'h', it
    > is only for the purpose of some other text operations that 'ch' are
    > (sometimes) considered a unit.

    I used to be fluent in written Spanish, but despite that, I never
    considered ch, ll, or rr to be single characters. I think I did
    a Spanish crossword once where ch went into a single square.

    Mike



    This archive was generated by hypermail 2.1.5 : Fri Oct 05 2007 - 00:31:16 CDT