Re: Matching Unicode strings and combining characters [was: basic question...]

From: Alain LaBonté (
Date: Thu Sep 30 1999 - 11:08:33 EDT

À 05:47 1999-09-30 -0700, Juliusz Chroboczek a écrit :

>The notion of observation over ASCII is the one that we've learned to
>live with. I am quite willing to expect that if I search for
>``interaction'', I'll also match the beginning of ``interactions''.
>But in Unicode, the problem is compounded by canonical equivalence and
>by the combining characters. Thus, we end up with the situation that
>it is impossible to match ``voila'' without also matching the
>beginning of ``voilà'' (more formally, there is no observation
>containing all the strings containing ``voila'' without also contaning
>at least some representations of ``voilà''). This is not what the
>user expects.

[Alain] Well, it depends. In most cases a French user making a "fuzzy"
search on "voila" will expect to also retrieve "voilà"... A better case to
see why one would like to do that (apart from the search on degraded
unaccented data or incorrectly accented data as long as on accented data)
is on words with varying accents (inflexions) depending on the tense, as in
"révèle", "révélé", 2 forms of the verb "révéler" ("indicatif présent, 1ère
et 3e personne du singulier" and "participe passé, masculin singulier"
forms)... You then might prefer to search on "revele" and retrieve both
forms (btw with "revele" you want to also retrieve "RÉVÈLE" and "RÉVÉLÉ",
so the limited technique you're talking about won't work for this).
Altavista allows this. It has simple searching rules that say that if you
don't put accents in your request and use only lower case, it means (in the
Altavista context, of course) that you want to retrieve all occurrences of
accented words with the same letters, and all capitalized instances, as
well as unaccented, uncapitalized forms. A precise Altavista search uses
exact accented characters and exact case: "CLÉ" would retrieve "CLÉ" but
not the beginning of "clef".

More complicated schemes might exist according to user requirements to
serve refined French-speakers' expectations.

That suggests quite a bit that this user requirement is orthogonal to coding.

ISO/IEC 14651 (ISO string ordering), the Unicode collation algorithm, and
CAN/CSA Z243.4.1 standards (sort/comparison standards) all deal implicitly
with this, and the solution should not bother about coding too much if one
wants to cater for actual user requirements.

>This problem does not occur if the combining characters
>are placed before, rather than after, the base character.

[Alain] It does occur... See above about capitalized forms.

Alain LaBonté

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT