RE: FW: Matching Unicode strings and combining characters [was: b asic

From: Marco.Cimarosti@icl.com
Date: Thu Sep 30 1999 - 14:16:24 EDT


Kevin was faster than John by a few seconds, so he "wins" my answer :-)

Unicode states that "logiñ" and "login~" are "canonically equivalent", that
is, they mean the same thing and should look exactly the same. Unicode also
says that this equivalence is so strong that you are allowed to convert all
"ñ"s to "n~"s or, vice versa, all "n~"s to "ñ"s, provided that you know what
you are doing.

What our compliant Display module would do with the two strings is to
display them both in the same way. This implies that the *human* operator
will see the same thing and act the same way. (S)he was waiting for the word
"login", and there it is (well, with that funny ~ floating on it)! So far so
good.

Now the problem is with the scripting part. The user can see the glyphs, and
"ñ" and "n~" look the same. But the script only sees strings of 16-bits
values, and these look different. And if the script, just like the human, is
waiting for "login", it would do two different things depending on how an
identical-looking-and-meaning text was spelled. Too bad!

There are, however, not one but two solutions:

#1
You state that your scripting facility is not intended to handle text. So
you pretend that, when the script looks for a string like "login", it is
actually handling a "string of bynary 16-bit words with undefined meaning".
This is a nice political trick to say that you don't care at all what
Unicode says, because that piece of data that you are handling is *not*
Unicode text for you.

#2
You don't want to play tricks to your users, so you go for the *real*
solution, that Unicode *does* provide. The honest solution is
*Decomposition*. The Scripting module should normalize incoming text by
*decomposing* all precomposed characters that have to be processed by the
current script. If this is done, the "logiñ" string would become "login~",
so the program would behave the same way in the two cases (the wrong way,
maybe, but at least consistent).

To implement solution #1, you just need change the documentation, not the
program...

About solution #2, I see two nice ways to impement it, and a bad one:

#2.a) The Explicit way: the scripting language has a built in function to
decompose a string (why not calling it wcsxfrm()?) that the script
programmer may explicity call in his compare statements.

#2.b) The Implicit way: the built in functions and/or operators to compare
strings automatically decompose their operands before the comparison. It
could be wise to allow the programmer to disable this feature, or even to
extend it (e.g. a case-insensitive option).

#2.c) The Bastard way: the text fed to the scripting language is *already*
decomposed. This is bad because it could fix a problem and cause another,
when the incoming binary data is actually not Unicode.

Notice that the same result *cannot* be obtained with a *precomposed*
normalization. In fact, when the software receives "ñ", it can split it in
"n" + "~" in no time. Vice versa, if it receives an "n" it cannot wait
indefinitely to see if a "~" comes, so that it can generate an "ñ".

Also notice that if Unicode Consortium would go crazy and decide to put
combining characters before base characters just to please TTY developers,
that would not help at all. In fact, if the application receives a "~" it
cannot wait indefinitely to see if an "n" comes, so that it can generate an
"ñ"...

The only thing that could actually screw this approach is if *new*
precomposed characters are added in Unicode. In this case, your program
dosn't know that the new character is equivalent to a sequence of old
characters, so it cannot decompose it.

But, again, this is a very general problem! No one (apart my bosses :-) can
ask a programmer to write an application that is compatible with *future*
versions of anything.

OK. However, I am not a member of Unicode and I have no interest at all in
defending Unicode's choices, so I hereby change my mind and join your
party...

*** Hey you, Unicode, put those bloody diacritics in front of the letters,
now! Or we'll do a net strike! ***

Regards :-)
        Marco

> -----Original Message-----
> From: Kevin Bracey [SMTP:kevin.bracey@pacemicro.com]
> Sent: 1999 September 30, Thursday 17.44
> To: Unicode List
> Subject: Re: FW: Matching Unicode strings and combining characters
> [was: basic
>
> In message <199909301519.IAA28612@unicode.org>
> Marco.Cimarosti@icl.com wrote:
>
> > What is the NEW problem brought by unicode or combining characters?
> >
> > Somebody says: if my application is waiting for "login", it will not
> trigger
> > if it receives "logiñ" (where ñ is a precomposed) but it would trigger
> with
> > "login~", (where ~ is a combining mark). That is true, so what!?
>
> The problem is that Unicode states that those two strings are canonically
> equivalent - treating them differently potentially leads to a whole new
> can of worms.
>
> --
> Kevin Bracey, Senior Software Engineer
> Pace Micro Technology plc Tel: +44 (0) 1223 518566
> 645 Newmarket Road Fax: +44 (0) 1223 518526
> Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT