Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Date: Thu Mar 12 1998 - 15:07:09 EST


Thank you all for helping me find out what I meant -- it was not what I
*thought* I meant:-)

- A somewhat programmable application which is uses Unicode internally,
  and which runs in a (mostly) ASCII or 8-bit environment.

  E.g. a newsreader which supports regular expressions in kill-files,
  or Unicoded versions of sed, perl, or emacs.

- This year, not two decades from now when everything is wonderful and
  Unicodified.

- A user or programmer who knows his single-byte machine and charset,
  and knows and uses regular expressions, but who may not be too aware
  of Unicode.
    
  So I want character ranges to work more or less as they do in the
  8-bit world, yet I want the program to catch most use of 8-bit
  character range in a way which won't work as an 8-bit user expects.

Keld Jørn Simonsen wrote:

> I think there may be two expression specifiers, one that takes
> the range of characters according to the current locale, and thn
> one dependent on the character set encoding.

Yes, that sounds simple. And I suspect I'm asking for *both* specifiers
to be differnt from what the user is used to (probably `-'), and for `-'
to be an error - maybe except with 7-bit ranges. This will be annoying,
but I don't see any other way to catch cases where the program can't
know which specifier the user meant.

I'm worried about a few things. First, regexps are often stored in
strings - which will then be converted to Unicode, and only later
transated to a regular expression. If the "locale" type of character
range is used, the program must know the source charset of the string in
order to know which characters to translate the range to.

Secondly, many people will only test their programs in a Latin-1
environment. There programs may `accitentally' work because Latin-1
shares character codes with iso10646. For example, [\177\200-\377]
catches all non-ASCII characters in a message. That won't work with
non-latin-1 if the "wrong" range type is used.

Also, I wonder if latin-1 \305 and Å amount to the same thing in a range
-- in particular regarding when to warn the user about doubtful regexps.
If \305 is converted to Å (as a string) before the regexp engine sees
it, this is not of much help. It would be a real pain to keep the
original spelling of the regexp around, in particular if it is
manipulated (e.g. "[A-Å]" -> "^[A-Å]*") before it is passed to the
regexp engine. Either way, this seems to dig into string parsing as
well.

Jeroen Hellingman wrote:

>> I won't expect programs to give all characters the correct collating
>> sequence in my language -- if nothing else, because a program often
>> can't known which language it is looking at. It only knows the charset.
>> Sometimes it can ask the user about the language, but not always.
>
> I _will_ expect exactly that. Most sensible Unicode applications will
> have to know the language anyway,

Ah, that's why we disagree. Maybe this will be true some day in the
future. But in the coming few years, there seems little hope for this.
For example, when you read an e-mail or news message or WWW page, how
does your reader/browse know which language it was written in? I
haven't seen many language tags in those. In my world, this information
is not available, so I can't expect [A-Å] to give all Norwegian letters.

  And I absolutely do not want programs that "must know" the language.
  (Or the charset for that matter, when they live in an 8-bit box). If
  that information is not available, something or someone would then
  have to guess the missing information, and the guess will sometimes be
  wrong. This is harmful compared to an honest attempt to handle
  `charset/language=unknown'. (I was about to say a lot more, but it
  seems unlikely to be is a revolutionary idea in *this* forum:-) BTW,
  could someone point me to the last discussion/conclusions about it?)

> You are right the the user may not know about Unicode; he may even not
> know about any code at all, but think of the letters in the alphabetic
> order he knows it, in which case he will expect that order when he
> gives a range.

That would be nice. However, it doesn't work on any UNIX program I have
tried, so at least 8-bit <-> Unicode translation won't break anything in
this respect. I'm more worried about things that work *almost* as
expected in some cases; such errors can creep in very easily.

Alain LaBonté wrote:

> Although this kind of practice is, if we talk about general-purpose
> appplication, a very bad programming technique,

Which programming technique exactly? I thought what I have described
worked nicely on single-charset applications, which a lot of us grew up
with. (Or a lot of Western people, anyway.) Then multi-charset
applications and multibyte charsets became popular and broke it.

> as long as there won't be a firm international standard convention
> (unfortunately there is a defacto standard [quite "international"
> among computer specialists] in some programming languages to that
> effect, exactly what Halvard expects),

Well, it's not "standard" exactly. At least here in my little UNIX
world it's just a combination of features:

   Simple and cleanly collated ASCII ranges which work like you'd want,
 + 8-bit fonts -- which came later,
 + 8-bit-clean libraries -- so we could really use our 8-bit fonts,
 = ranges that are properly collated in the 7-bit part and more
   random in the 8-bit part.

So, we wanted to use ranges because they are useful, and we use what
happens to work.

> there is in the couple of standards projects that are ISO/IEC 14651
> and 14652

Where can I find them? I did know once, but...

> 14652 describes the form <character symbol 1>...<character symbol 2> to
> define a coded-character dependent ellipsis (well, what you call a "regular
> expresion",

I don't remember what 14652 says, but I guess this is what I call
"character range". A regular expression (or regexp or regex) describes
a deterministic finite automaton, with a syntax more or less like this:

        /ab?c|foo(-[A-F0-9])*/

which matches

        "a" followed by 0 or 1 "b"s followed by "c", *or*
        "foo" followed by 0 or more "-<hex digit>".

[A-F] and [0-9] are character ranges. [A-F0-9] is - well, a combination
of two ranges, I guess. "character class", maybe.

> thanks to remind me this very ambiguous term which I had forgot
> the meaning

Very true. The above is what regexps are defined to mean (except the
syntax varies somewhat). However, there are common extensions DFAs
can't handle, like backreferences: In /a(b*)c\1/, \1 will match the same
string as the 1st () matched. This matches e.g. abbbcbbba, but not
abbbcba.

> "A to Z" does not imply "a to z", does it ?

That may depend on an ignore-case flag which is set outside the regular
expression; several common regexp implementations provide such a flag.
The old Unix implementations I have seen don't, though.

> One should not expect the end-user to know what is under the hood!

Then set the default value of ignore-case to true for the benefit or the
end user.

> And "A to z" leads to no hit in EBCDIC, while "a to Z" will leads to
> no hit in ASCII!

With ignore-case, [A-z] and [a-Z] should both translate to [a-zA-Z].

-- 
Hallvard



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT