Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mike (mike-list@pobox.com)
Date: Fri Sep 21 2007 - 17:07:19 CDT

Next message: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"

Previous message: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Philippe Verdy: "RE: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

When I decided to implement Unicode regular expressions,
I spent some time looking at perl regular expressions,
and a lot of time thinking about what opportunities there
are for a Unicode version. Here is what I came up with
for the "ultimate" Unicode regular expression syntax.

Some of the perl syntax is based on ASCII, such as \f \v
\r \n, and doesn't have much use in Unicode. I decided to
drop these in favor of a \n that represents any of the
newline sequences; \f and \r are not used and \v is given
a new use (see below). If you really need to look for
a line feed character, you can specify it as \u000A or
\N{LF}. Using \t for tab is still relevant, so I kept
that.

Another decision I made was to have '.' match a grapheme
cluster such as A + ACUTE as a single entity. This is
still experimental, since I don't have any way to get
feedback on whether users would like it. I may turn it
into an option, so you could have dot either match just
a code point or a grapheme cluster. In any case, \c can
be used to match a single code point. Some grapheme
clusters are defective in that there is no base character,
so I added an option to allow . to match them (true by
default). You can also look for a defective combining
character sequence using \F.

I found that the word boundary (\b and \B), digit (\d and
\D), and word character (\w and \W) were still useful, so
these are retained (although they are more complex --
word boundaries are based on the Word_Break property for
example).

Unicode has some features that deserve compact syntax
such as:

   \a assigned code point
   \A unassigned code point
   \g default grapheme cluster boundary
   \G complement of \g
   \h hex digit
   \m combining character (equivalent to \p{M})
   \M complement of \m

So, for example, you can search for any variant of the
letter e with /e\m*/.

I felt it was time to make reparations to the East Asian
population for making it so difficult for them to use
their native languages on computers, so I added syntax
just for them:

   \i CJK ideograph
   \I Unified ideograph
   \K Katakana
   \H Hiragana
   \L leading jamo
   \V vowel jamo
   \T trailing jamo

A Hangul syllable can be found using \L+\V+\T*. In
my implementation, I convert both the pattern and text
to search into NFD, so unfortunately \I is not nearly
as useful as I had thought it would be -- most of the
non-Unified characters canonically decompose into a
unified ideograph. I'm hoping to figure out a solution
to this problem (but it's an implementation issue, so
I think having \I is still valid).

My code supports Unicode versions 3.2, 4.0, 4.1, and 5.0,
so I added a way to specify which version to use for
character properties:

\v{version} e.g. /\v{4.1}\A+/

\p and \P are similar to what you have defined, but as
we've been discussing, I allow multiple values:
\p{gc=L|M|N}
and in some cases comparisons:
\p{Numeric_Value>=10}, \p{ccc<230}

\u and \U are the same except I got rid of the two extra
leading zeros in \U since a code point is always
representable in 24 bits, e.g. [\u0000-\U10FFFF]

\N{name} works with character names and also named
character sequences

Another experimental part of my implementation is that
a pattern can only match if it starts and ends on a
grapheme cluster boundary. This prevents, for example,
the Hangul syllable \uAC00 from matching the first part
of \uAC01 which is composed of the same leading and vowel
jamos, but which also has a trailing jamo.

If anybody thinks that any of this is bad design, I'd
be happy to hear suggestions for improvement!

Mike

Next message: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Previous message: Andy Heninger: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Philippe Verdy: "RE: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 17:11:11 CDT