Re: FW: Matching Unicode strings and combining characters [was: basic

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Thu Sep 30 1999 - 14:59:59 EDT


    John> The point is that either "logi0xf1" or "login~" must be acceptable,
    John> and distinct from "login". Before combining characters, this wasn't
    John> a problem.

I'm surprised nobody has pointed this out yet, but the answer is simple.
Do not normalize. Change the pattern to a regular expression including the
variations that can occur (i.e. "logi˝:" changes to "logi(˝|n~):").

Otherwise nothing can work unless the host and client agree on a normalization
level. In my opinion, the two options available are to extend protocols for
normalization level negotiation or use regular expressions. You guess which
is more tractable.

Read Daniel Yacob's paper on regular expressions and Unicode Amharic. It
brings up similar issues.

  ftp://ftp.cs.indiana.edu/pub/fidel/perl-unicode/regex992.ps.gz
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab The first virtue is to restrain the tongue;
New Mexico State University he approaches nearest to the gods who knows
Box 30001, Dept. 3CRL how to be silent, even though he is in the
Las Cruces, NM 88003 right. -- Cato the Younger (95-46 B.C.E)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT