Re: Regular expressions in Unicode (Was: Ethiopic text)

From: gwm@austin.ibm.com
Date: Thu Mar 12 1998 - 10:53:23 EST


>From: "Alain LaBont\i\" <alb@sct.gouv.qc.ca>"@ausmail.austin.ibm.com
>Reply-To: unicode@unicode.org
>To: Unicode List <unicode@unicode.org>
>Date: Thu, 12 Mar 1998 06:33:52 -0800 (PST)
>Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
>
>A 04:17 98-03-12 -0800, Jeroen Hellingman a =E9crit :
>>his field of knowledge, ASCII can be overseen, but Unicode is too large
>>for most people to oversee the effects of a range selection.
>
>[Alain] :
>Even ASCII range is problematic in English...
>
>"A to Z" does not imply "a to z", does it ?
>
>One should not expect the end-user to know what is under the hood!
>
>And "A to z" leads to no hit in EBCDIC, while "a to Z" will leads to no h=
>it
>in ASCII!
>
>Alain LaBont=E9
>Qu=E9bec
>

I believe that the POSIX approach of character class expressions
can shed some light in this area. The character class expressions
are based on examination of what end-users have traditionally intended
when expressions such as [a-z], [A-Z], [0-9], [a-zA-Z], etc. were
used. The examination concluded that generally the intent was

        [a-z] - lowercase
        [A-Z] - uppercase
        [0-9] - digits
        [a-zA-Z] - alphabetics

This lead to the notation:

        [:alnum:]
        [:alpha:]
        [:blank:]
        [:cntrl:]
        [:digit:]
        [:graph:]
        [:lower:]
        [:print:]
        [:punct:]
        [:space:]
        [:upper:]
        [:xdigit:]

which allows end-users to obtain the necessary information without
having to "know what is under the hood."

-------------------------------------------------------------------------
   Gary W. Miller Internet - gwm@austin.ibm.com
   IBM JTMS/903 ZIP 9374 X/Open - g.miller@xopen.co.uk
   11400 Burnet Road VNET - AUSTIN(GWM) / GWM at AUSTIN
   Austin, Texas 78758 SENDFILE - GWM at AUSVM6
   Phone: (512) 838-8297 Fax: (512) 838-0169
-------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT