Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (
Date: Mon Sep 24 2007 - 10:25:37 CDT

  • Next message: Gerrit Sangel: "Composition of not included Chinese characters"

    > I don't think it will ever really be feasible to define regular
    > expressions in terms of specific languages, to the point of treating
    > combinations of two or more base characters as a single matchable
    > "character" on the basis that speakers of language X consider the
    > combination to be a single "letter."

    It is feasible, and I already have working code.

    There is no avoiding it. Consider: [\uAC00-\uD7A3] which should
    match any LV or LVT Hangul syllable. That character class needs
    to be able to match any of the precomposed characters listed in
    the range, but also must match any sequence of jamos that is
    canonically equivalent, such as <U+1103 U+1167 U+11AB>.

    The specification uses as an example, [a-z\q{x\u0323}], which
    allows American Indians to treat x with an under dot as a single
    character even though there is no precomposed character for it.

    I also allow you to put named character sequences in a character
    class: [\N{KATAKANA LETTER AINU P}] and they always consist of
    multiple code points, by definition.


    This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 10:28:55 CDT