Re: New Public Review Issue: Proposed Update UTS #18

From: Doug Ewell (
Date: Tue Sep 25 2007 - 01:23:21 CDT

  • Next message: Julian Bradfield: "Re: Composition of not included Chinese characters"

    "Mike" <mike dash list at pobox dot com> wrote:

    >> I don't think it will ever really be feasible to define regular
    >> expressions in terms of specific languages, to the point of treating
    >> combinations of two or more base characters as a single matchable
    >> "character" on the basis that speakers of language X consider the
    >> combination to be a single "letter."
    > It is feasible, and I already have working code.

    Sorry, I made two huge mistakes in my earlier post:

    1. I should never have thrown down the gauntlet to the regex mavens in
    the first place. Dinking around with regular expressions is a popular
    pastime; I'm sure lots of people really do think they have devised an
    elegant language-dependent solution.

    2. I should have been much more clear: what I don't think is feasible
    is to specify regexes in a language-dependent way, such that a certain
    combination means different things depending on some sort of language
    "mode." An example would be treating the sequence "[ch]" as a choice
    between 'c' and 'h' in English, but as a single "letter" in Spanish or
    Slovak or what have you.

    Note carefully that I used the word "feasible" and not the word
    "possible." By adding more and more hair to the syntax, it becomes
    "possible" to do just about anything imaginable with regexes, at
    significant cost to clarity and elegance.

    > There is no avoiding it. Consider: [\uAC00-\uD7A3] which should match
    > any LV or LVT Hangul syllable. That character class needs to be able
    > to match any of the precomposed characters listed in the range, but
    > also must match any sequence of jamos that is canonically equivalent,
    > such as <U+1103 U+1167 U+11AB>.

    That solution would be specific to Korean, but would not be interpreted
    differently in a Korean-language context vs. a non-Korean-language
    context, which is how I should have phrased it.

    > The specification uses as an example, [a-z\q{x\u0323}], which allows
    > American Indians to treat x with an under dot as a single character
    > even though there is no precomposed character for it.

    I did say "two or more base characters." Combining characters are a
    different kettle of fish, and indeed your solution does make the most
    sense for combining characters.

    > I also allow you to put named character sequences in a character
    > class: [\N{KATAKANA LETTER AINU P}] and they always consist of
    > multiple code points, by definition.

    But again, the behavior is not different for different languages, right?

    Now on the other hand, Andy Heninger wrote:

    > POSIX has defined exactly that, see
    > "Collation Elements" are locale (language) specific multi-character
    > sequences that can appear as set elements in bracket expressions.
    > I'm not sure that it's a particularly good idea, but it has been done.

    It looks like this is defined in terms of *my* locale, which will
    probably conform to English rules and will probably not include the

    collating-element <ch-digraph> from "<c><h>"

    whereas someone with a Traditional Spanish Sort locale might have this
    line. This means the same text would match differentlydepending on who
    is grepping it.

    What I had in mind as being infeasible was a way to specify the language
    mode *in the regex itself*, so I could use "[ch]" against English text
    with one meaning and use "[ch]" against traditional Spanish text with
    the othe rmeaning.

    Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14

    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 01:25:57 CDT