Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mike (mike-list@pobox.com)
Date: Fri Sep 21 2007 - 17:07:19 CDT

  • Next message: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"

    When I decided to implement Unicode regular expressions,
    I spent some time looking at perl regular expressions,
    and a lot of time thinking about what opportunities there
    are for a Unicode version. Here is what I came up with
    for the "ultimate" Unicode regular expression syntax.

    Some of the perl syntax is based on ASCII, such as \f \v
    \r \n, and doesn't have much use in Unicode. I decided to
    drop these in favor of a \n that represents any of the
    newline sequences; \f and \r are not used and \v is given
    a new use (see below). If you really need to look for
    a line feed character, you can specify it as \u000A or
    \N{LF}. Using \t for tab is still relevant, so I kept

    Another decision I made was to have '.' match a grapheme
    cluster such as A + ACUTE as a single entity. This is
    still experimental, since I don't have any way to get
    feedback on whether users would like it. I may turn it
    into an option, so you could have dot either match just
    a code point or a grapheme cluster. In any case, \c can
    be used to match a single code point. Some grapheme
    clusters are defective in that there is no base character,
    so I added an option to allow . to match them (true by
    default). You can also look for a defective combining
    character sequence using \F.

    I found that the word boundary (\b and \B), digit (\d and
    \D), and word character (\w and \W) were still useful, so
    these are retained (although they are more complex --
    word boundaries are based on the Word_Break property for

    Unicode has some features that deserve compact syntax
    such as:

       \a assigned code point
       \A unassigned code point
       \g default grapheme cluster boundary
       \G complement of \g
       \h hex digit
       \m combining character (equivalent to \p{M})
       \M complement of \m

    So, for example, you can search for any variant of the
    letter e with /e\m*/.

    I felt it was time to make reparations to the East Asian
    population for making it so difficult for them to use
    their native languages on computers, so I added syntax
    just for them:

       \i CJK ideograph
       \I Unified ideograph
       \K Katakana
       \H Hiragana
       \L leading jamo
       \V vowel jamo
       \T trailing jamo

    A Hangul syllable can be found using \L+\V+\T*. In
    my implementation, I convert both the pattern and text
    to search into NFD, so unfortunately \I is not nearly
    as useful as I had thought it would be -- most of the
    non-Unified characters canonically decompose into a
    unified ideograph. I'm hoping to figure out a solution
    to this problem (but it's an implementation issue, so
    I think having \I is still valid).

    My code supports Unicode versions 3.2, 4.0, 4.1, and 5.0,
    so I added a way to specify which version to use for
    character properties:

       \v{version} e.g. /\v{4.1}\A+/

    \p and \P are similar to what you have defined, but as
    we've been discussing, I allow multiple values:
    and in some cases comparisons:
        \p{Numeric_Value>=10}, \p{ccc<230}

    \u and \U are the same except I got rid of the two extra
    leading zeros in \U since a code point is always
    representable in 24 bits, e.g. [\u0000-\U10FFFF]

    \N{name} works with character names and also named
    character sequences

    Another experimental part of my implementation is that
    a pattern can only match if it starts and ends on a
    grapheme cluster boundary. This prevents, for example,
    the Hangul syllable \uAC00 from matching the first part
    of \uAC01 which is composed of the same leading and vowel
    jamos, but which also has a trailing jamo.

    If anybody thinks that any of this is bad design, I'd
    be happy to hear suggestions for improvement!


    This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 17:11:11 CDT