Regex and arcane parsing

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 29 1997 - 17:32:47 EST


> Another charter member of Sun's Java team later confided that they
> (the Java team) "didn't have a clue" about how to go about handling all
> the "incredibly arcane problems" involved in parsing Unicode

I hope that people who are concerned about implementing regular
expression parsers in Unicode and/or other kinds of parsers
(for example SQL expression parsers, etc.), are paying attention
to the implementation guidelines in the Unicode Standard, Version 2.0,
in particular, Section 5.14 Identifiers, and the errata to that
section posted on the unicode.org web site:

http://www.unicode.org/unicode/uni2errata/UnicodeDatabaseErrata.html

Identifier parsing is relatively straightforward in Unicode, given
properly defined classes. The same principles can be applied to
regular expression parsers to specify the classes of characters to
be matched by various wildcard characters.

It would truly be a shame if every regex developer were to cobble up
a different solution for what did and did not match, when a standard
specification is available, and the machine readable versions of the
data specifying the classes are available on the ftp site.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT