Rick McGowan <Rick_McGowan@next.com> wrote:
> A couple of questions for any Unix-heads out there...
> 1. Does anyone have a "freely distributable" regular expression parser and/or
> text finding routine with these characteristics:
> A. It eats Unicode text or UTF-8 on input
> B. It uses the typical Unixoid regular expression syntax
> 2. Has anyone given any serious thought to extensions of said Unixoid regular
> expression syntax to handle non-English alphabets used as "ranges" for pattern
> Thanks for any info,
Re: 1) Perl 5, the "freely distributable" mother of all regular
expression engines, now has a wide_char module that may suit your needs.
I haven't used it yet, though, so I'm not sure what it's capabilities
are. You can find it at http://www.perl.com/perl in the CPAN archives.
(I think the name is something like widechar.pm). Also, there is a
Japanese Perl, if by some chance that is specifically what you are
Re: 2) The Plan 9 Team (now the "Inferno Team") at AT&T were working on
this. Perhaps someone else knows something about it, but I don't.
When I asked James Gosling at Sun about this, he said he was giving
"serious thought" to how to do it "right" in Java. My impression was
that the effort was unlikely to progress beyond "thought" any time soon.
;-) Another charter member of Sun's Java team later confided that they
(the Java team) "didn't have a clue" about how to go about handling all
the "incredibly arcane problems" involved in parsing Unicode, so it was
likely, he believed, that they would concentrate on getting the pure
unicode foundation solid, then letting third-parties build sensible
unicode-based regex libraries.
Also, the Java team members at Symantec said they were going to work on
a Perl-style regex library for (possible) eventual inclusion in Visual
Cafe, but I haven't heard anything about it since last Fall.
In the meantime, there are a handful of Java regex libraries available
at http://www.gamelan.com, most with source code. You could presumably
modify them to do what you want to do specifically.
Hope that helps,
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT