Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Tue Sep 25 2007 - 15:23:23 CDT

  • Next message: Rick McGowan: "Public Review Issues update: UAX #29 and UAX #31"

    >>> I don't think it will ever really be feasible to define regular
    >>> expressions in terms of specific languages, to the point of treating
    >>> combinations of two or more base characters as a single matchable
    >>> "character" on the basis that speakers of language X consider the
    >>> combination to be a single "letter."
    >>
    >> It is feasible, and I already have working code.
    >
    > Sorry, I made two huge mistakes in my earlier post:
    >
    > 1. I should never have thrown down the gauntlet to the regex mavens in
    > the first place.

    I had to look up maven in the dictionary, and it means (according to
    Princeton University), "someone who is dazzlingly skilled in any
    field." So I guess I should be flattered, but when I first read it,
    it sounded like an insult. In truth, I have been just a user of
    regular expressions (using the excellent pcre project) until May of
    this year, when I decided to try implementing them myself.

    > Dinking around with regular expressions is a popular
    > pastime; I'm sure lots of people really do think they have devised an
    > elegant language-dependent solution.

    I don't consider what I do to be "dinking around" as I'm sure you
    wouldn't say you dink around with language tags in describing your
    own work.

    > 2. I should have been much more clear: what I don't think is feasible
    > is to specify regexes in a language-dependent way, such that a certain
    > combination means different things depending on some sort of language
    > "mode." An example would be treating the sequence "[ch]" as a choice
    > between 'c' and 'h' in English, but as a single "letter" in Spanish or
    > Slovak or what have you.

    Nobody has suggested that "[ch]" would mean anything different in
    any language. To specify that you want to treat "ch" as a single
    character, you can use either [[.ch.]] or [\q{ch}]. The former
    is POSIX syntax, and I don't know who invented the \q notation.
    As I mentioned in a previous message, this functionality is
    *required* for level 2 conformance.

    > Note carefully that I used the word "feasible" and not the word
    > "possible." By adding more and more hair to the syntax, it becomes
    > "possible" to do just about anything imaginable with regexes, at
    > significant cost to clarity and elegance.

    I am very aware of the difference between "feasible" and "possible."
    When I design, I prefer to go by what is "useful" and "usable" and
    necessarily has to be clear and elegant.

    >> The specification uses as an example, [a-z\q{x\u0323}], which allows
    >> American Indians to treat x with an under dot as a single character
    >> even though there is no precomposed character for it.
    >
    > I did say "two or more base characters." Combining characters are a
    > different kettle of fish, and indeed your solution does make the most
    > sense for combining characters.

    Then I chose the wrong example from the spec. It also contains the
    character class example, [a-z\q{aa}] which allows Danish users to
    match "aa" as a single character.

    If you can specify that a character class matches specific grapheme
    clusters, I think that a natural extension of this is to be able to
    specify grapheme clusters that should be matched by "." (which is
    just a character class itself).

    > Now on the other hand, Andy Heninger wrote:
    >
    >> POSIX has defined exactly that, see
    >> http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_05
    >>
    >> "Collation Elements" are locale (language) specific multi-character
    >> sequences that can appear as set elements in bracket expressions.
    >> I'm not sure that it's a particularly good idea, but it has been done.
    >
    > It looks like this is defined in terms of *my* locale, which will
    > probably conform to English rules and will probably not include the line:
    >
    > collating-element <ch-digraph> from "<c><h>"
    >
    > whereas someone with a Traditional Spanish Sort locale might have this
    > line. This means the same text would match differentlydepending on who
    > is grepping it.

    I agree with you that behavior should be the same for all users.
    Who would argue otherwise? But would you say there shouldn't be
    a way to specify a locale to work with? I've thought about how
    I would do it and came up with \l{locale}. A regular expression
    without \l would behave in the normal language-independent mode,
    but if your expression was /\l{es}./, the Spanish locale would
    enable the . to match "ch", "ll", or "rr" as a single character.

    > What I had in mind as being infeasible was a way to specify the language
    > mode *in the regex itself*, so I could use "[ch]" against English text
    > with one meaning and use "[ch]" against traditional Spanish text with
    > the othe rmeaning.

    I would agree that "[ch]" should never mean match "ch" as one
    character. You need to use [.ch.] or \q{ch} for that.

    Mike



    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 15:26:45 CDT