Re: FDAM 18 to 10646-1: the real character searching issue

From: Alain LaBont\i\ (alb@sct.gouv.qc.ca)
Date: Wed Mar 31 1999 - 10:49:26 EST


A 05:59 99-03-31 -0500, Johan van Wingen a écrit :
>Dear Colleagues
>
>This is my last message from this E-mail address.
>The question left is really serious.
>
>FDAM 18 to 10646-1 is now under vote, ending 4 May 1999. It contains
>four letters "with comma below". The Netherlands NB has objected to
>inclusion of these before, and will continue to do so.
>We have investigated the consequences of having these characters in the
>standard, and concluded that they may cause considerable confusion with
>users. There is a warning against inconsiderate use of them in Annex P,
>but that could be overlooked easily. It also is unclear.
>
>1. Romanian or Turkish users will find in their newspapers often on the
>same page letters with comma below or with cedilla, depending on the
>font applied at a place. They will be in doubt whether to code these
>equally or differently.
>2. Users may code by mistake a letter with cedilla as one with comma
>below, in particular where a letter looks like that. This will cause
>problems in administrative systems where identical names are not being
>recognized as such. Our investigations showed that this effect occurs
>indeed in our systems with letters having much likeness to others.
>
>Thus I urge you, if you are in the position to influence your national
>position, to propose a NO vote on this FDAM. Standards are meant for
>removing problems, not for creating new ones.
>
>Best regards from J. W. van Wingen
>E-mail: PRECAL@rulmvs.LeidenUniv.nl
>P.O.Box 486, NL-2300 AL Leiden, Netherlands, phone +31 71 5 14 37 39

[Alain]
Point 2 proves one thing. Search procedures and sorts should not be blindly
done on "non-normalized" strings (I use here the quoted term "normalized"
not in the sense of Unicode but I don't find a better word).

My name, "La Bonté", written with one or more spaces (you will understand
why I most of the times omit the space -- in fact this intentional mistake
has become my trademark (; ), is never retrieved by dumb administrative
systems. This caused me innumerable problems in real life too, and the
problem is not related to the misuse of characters (in fact it is solved by
misuse, unfortunately!) In the same way when the acute accent is omitted,
it creates trouble too, but to a lesser extent although it also creates
"innumerable÷2" problems, let's say. The use or non-use of case also
creates "innuerable÷3" problems, but it creates problems too. (: (;

In the Québec government, we have tools to retrieve things based on the
"equivalence at level 1", to use the same terminology used in ISO/IEC 14651
(Ordering standad), a standard in practice convergent with the Unicode
ordering algorithm.

Altavista search technology would also, given a "s cedilla" or "s comma
below", store both letters under 2 indices, one as is, for exact matches,
and one unaccented, uncased, for matches such as the administrative
requirment Johan is talking about. This explains why as an indexing engine,
this technology is superior, in my humble opinion.

So point 2 should not, imho, matter. In fact the glitch Johan signals could
help improve systems which are otherwise flawed for more profound reasons.

Searching also has to be flexible, and allow for direct matches, imprecise
matches, with or without normal spaces considered significant, idem for
case and "special character" (conditionally ignorable) processing, and so
on. At a higher level other requirements, less fundamental but useful,
could also be used (fuzzy matches, matches in other languages, and so on).

That said if the young children are taught in Romania that an S COMMA is an
S COMMA (and that is actually the case, a profound reason for having this
letter coded for what it is, with alos a name on its own), it is
unreasonable that this character can not be considered different from S
CEDILLA, even if fonts in actual newspapers in Turkey and in Romania can
show (because of the supplied technology or for othr reasons) both
characters, in appearance indifferently.

I hope this helps. These were my two cents. I urge people to vote in favour
of amendment 18 to ISO/IEC 10646 for the reasons I invoke.

Soory, Johan, we can not agree on everything, even if I must admit that the
fact you report concerning newspapers in Turkey and in Romania is true. But
this should not be the only consideration. There are more fundamental ones.

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT