Re: regular expressions

From: Alain LaBont/e'/ (alb@sct.gouv.qc.ca)
Date: Tue Feb 04 1997 - 08:58:15 EST


At 19:01 97-02-03 -0700, Mark Leisher wrote:
>As a programmer, I don't look at regular expressions as patterns of letters,
>I look at them as patterns of character codes. Occasionally, these character
>codes happen to have graphic symbols associated which can be used as a
>shorthand.

That makes your program not portable, but that is a choice I can't discuss..

>I further contend that no programmer experienced with character sets other
>than ASCII still believes that regexp's are patterns of letters and not
>character codes. Unfortunately, many programmers still believe the "letter"
>model.

Precisely, we are on the same wave length. I would say most programmers,
though. And those who pretend the contrary most of the time unconsciously
behave like the others. They know it, but that knowledge is not integrated
in their mind (because it is misleading and not natural).

>I am looking at a document containing a mixture of Arabic, Chinese, English
>and Vietnamese (encoded in UCS2). When I say, "find all occurences of [a-z]",
>I want the ability to get different behaviors depending on the task:
>
> 1. Skip the Arabic text, match all English lower case letters between "a"
> and "z" in the collation sense, match all the lower case QuanJiao
> (a.k.a. Zenkaku, Fullwidth, or Wide) letters between "a" and "z" in
> the Hanyu Pinyin collation sense, and match lower case Vietnamese
> consonants plus vowels with no tone or diacritic marks.

The "hanyu pinyin" I know (the canonical one, the one I use) also has
tone-mark accented letters, certain with two marks like Vietnamese... just a
remark like this...

> 2. Everything the same as (1) except match *all* lower case Vietnamese
> letters between "a" and "z" in the collation sense, which includes all
> the vowels with tone and/or diacritic marks.
>
>Do I need different version of the UCS2 locale for each possible
>interpretation of "[a-z]" I need for Vietnamese?

That's the idea of a traditional locale. However in ISO/IEC CD 14651 we
completely redefine the comparison operation (we introduce the concept of
equivalence at different levels of comparison) and that allows you to do all
that with a single operation without changing locales...

Hence "La Bonté" is equivalent to "labonte" at level 1, it is not at level 2
or higher... it is equivalent to "labonté" at level 2, it is not at level 3
or higher (note that there is a space in the reference, none in the
comparand, that is intentional), it is equivalent to "LaBonté" at level 3
and it is not at higher levels and finally it is absolutely identical (or
equivalent at level 4) to "La Bonté" (note that the standard makes this
code-independent too!)

I think that this information could be useful to internationalizers/localizers.

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT