Re: Regular expressions in Unicode (Was: Ethiopic text)

Date: Thu Mar 12 1998 - 10:53:23 EST

>From: "Alain LaBont\i\" <>"
>To: Unicode List <>
>Date: Thu, 12 Mar 1998 06:33:52 -0800 (PST)
>Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
>A 04:17 98-03-12 -0800, Jeroen Hellingman a =E9crit :
>>his field of knowledge, ASCII can be overseen, but Unicode is too large
>>for most people to oversee the effects of a range selection.
>[Alain] :
>Even ASCII range is problematic in English...
>"A to Z" does not imply "a to z", does it ?
>One should not expect the end-user to know what is under the hood!
>And "A to z" leads to no hit in EBCDIC, while "a to Z" will leads to no h=
>in ASCII!
>Alain LaBont=E9

I believe that the POSIX approach of character class expressions
can shed some light in this area. The character class expressions
are based on examination of what end-users have traditionally intended
when expressions such as [a-z], [A-Z], [0-9], [a-zA-Z], etc. were
used. The examination concluded that generally the intent was

        [a-z] - lowercase
        [A-Z] - uppercase
        [0-9] - digits
        [a-zA-Z] - alphabetics

This lead to the notation:


which allows end-users to obtain the necessary information without
having to "know what is under the hood."

   Gary W. Miller Internet -
   IBM JTMS/903 ZIP 9374 X/Open -
   11400 Burnet Road VNET - AUSTIN(GWM) / GWM at AUSTIN
   Austin, Texas 78758 SENDFILE - GWM at AUSVM6
   Phone: (512) 838-8297 Fax: (512) 838-0169

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT