Re: Derived age regexp

From: Markus Scherer (markus.icu@gmail.com)
Date: Fri Oct 15 2010 - 18:04:36 CDT

Next message: Eric Muller: "Re: Derived age regexp"

Previous message: Saqqara: "Re: OpenType update for Unicode 5.2/6.0?"
In reply to: Tim Greenwood: "Derived age regexp"
Next in thread: Eric Muller: "Re: Derived age regexp"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Fri, Oct 15, 2010 at 3:19 PM, Tim Greenwood <timothy@greenwood.name>wrote:

> Is there any regular expression - in perl, or elsewhere, that enables
> searching on the derived age? I want to find all characters in a file added
> since Unicode 4.1.
> I could write it all by processing against the derived age file, but it
> would be nice if it is ready to go.
>

You could use an ICU UnicodeSet or an ICU regular expression.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:^Cn:]%26[:^age%3D4.1:]]&abb=on&g=
http://userguide.icu-project.org/strings/unicodeset
http://userguide.icu-project.org/strings/regexp

A (frozen) UnicodeSet with its span() or spanUTF8() method might suffice,
depending on what you need.

We also have dedicated API (UCharacter.java/uchar.h) for the non-Unihan
properties.

Note what UTS #18 <http://www.unicode.org/reports/tr18/> says about [:age:]
or \p{age} (which ICU implements):

*Age
**Caution:* The
DerivedAge<http://www.unicode.org/Public/UNIDATA/DerivedAge.txt> data
file in the UCD provides the deltas between versions, for compactness.
However, when using the property all characters included in that version are
included. Thus\p{age=3.0} includes the letter *a*, which was included in
Unicode 1.0. To get characters that are new in a particular version,
subtract off the previous version as described in 1.3 Subtraction and
Intersection<http://www.unicode.org/reports/tr18/#Subtraction_and_Intersection>.
For example: [\p{age=3.1} -- \p{age=3.0}]

Best regards,
markus

Next message: Eric Muller: "Re: Derived age regexp"
Previous message: Saqqara: "Re: OpenType update for Unicode 5.2/6.0?"
In reply to: Tim Greenwood: "Derived age regexp"
Next in thread: Eric Muller: "Re: Derived age regexp"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Oct 15 2010 - 18:06:41 CDT