Re: New Public Review Issue: Proposed Update UTS #18

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Sep 24 2007 - 15:52:56 CDT

  • Next message: Marnen Laibow-Koser: "Re: Composition of not included Chinese characters"

    Having named character sequences in \N is an interesting idea. Would you
    mind proposing that to the UTC using the online form? (That's the way to
    raise issues to the UTC's attention.)

    BTW, Andy and I concluded that the really effective way to do canonical
    equivalence in regex would be in a mode where grapheme cluster is the unit,
    not code point.

    On the comment on "feasible" -- I think the reference there was to
    language/locale-sensitive regex. That involves a few things which are quite
    tricky, and are thus listed under Level 3 in UTS#18.

       - sensitivity: "aa" matches a-ring in Danish
       - language-sensitive ordering ranges: [a-z] doesn't include o-slash in
       Danish
       - language-sensitive grapheme clusters: a dot matches "ch" in Slovak
       - ...

    Few implementations try to handle locale-sensitivity except for POSIX (and
    that has significant problems in it). I wouldn't say that they are
    infeasible, but they are tricky.

    Mark

    On 9/24/07, Mike <mike-list@pobox.com> wrote:
    >
    > > I don't think it will ever really be feasible to define regular
    > > expressions in terms of specific languages, to the point of treating
    > > combinations of two or more base characters as a single matchable
    > > "character" on the basis that speakers of language X consider the
    > > combination to be a single "letter."
    >
    > It is feasible, and I already have working code.
    >
    > There is no avoiding it. Consider: [\uAC00-\uD7A3] which should
    > match any LV or LVT Hangul syllable. That character class needs
    > to be able to match any of the precomposed characters listed in
    > the range, but also must match any sequence of jamos that is
    > canonically equivalent, such as <U+1103 U+1167 U+11AB>.
    >
    > The specification uses as an example, [a-z\q{x\u0323}], which
    > allows American Indians to treat x with an under dot as a single
    > character even though there is no precomposed character for it.
    >
    > I also allow you to put named character sequences in a character
    > class: [\N{KATAKANA LETTER AINU P}] and they always consist of
    > multiple code points, by definition.
    >
    > Mike
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 15:54:29 CDT