Re: Unicode Search Engines

From: DougEwell2@cs.com
Date: Wed Jan 30 2002 - 01:51:20 EST


In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
stefan.probst@opticom.v-nam.net writes:

> I would like to add:
> How do they handle normalization?
> In Vietnam, many characters can be represented in several different ways:
> (1) fully precomposed (NFC)
> (2) base character and modifier precomposed, tonal mark combining
> (3) base character, then modifier, then tonal mark
> (4) like (3), but modifier and tonal mark sorted (NFD)
> Do the search engines do any normalization, before indexing a page?
> Are queries normalized before running the search?

I'm not sure what sort of normalization might be performed by search engines,
but I want to examine the Vietnamese decomposition aspect for a moment.

If you have a Vietnamese vowel with both modifier and tone mark, say LATIN
CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in
Unicode in at least three ways:

(1) fully precomposed (NFC) -- that is, U+1EA4
(2) base character and modifier precomposed, tonal mark combining -- that is,
U+00C2 U+0301
(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302
U+0301

So far, so good. But then we have:

> (4) like (3), but modifier and tonal mark sorted (NFD)

If "sorting" the diacritical marks in NFD results in rearranging the two
diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of
Vietnamese orthography, the NFD form may not really be a legitimate way of
representing the Vietnamese letter.

For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is,
in Vietnamese, a circumflexed A to which a tone mark (dot below) has been
added. It is not a dotted-below A to which a circumflex has been added. Yet
because of the canonical combining classes of the two diacriticals (230 for
COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how
the character will be decomposed.

In theory, there is actually a case 5: base character and tonal mark
precomposed, modifier combining. In terms of Vietnamese orthography, this is
just as illegitimate as case 4 (NFD), but most software that processes
Vietnamese text will probably never encounter it. But it will have to handle
the NFD case.

If I were on some other mailing lists I could think of, I would claim that
this is a fatal flaw in the design of Unicode Normalization Form D. It's
not, but it is a sticky problem that needs to be dealt with when dealing with
Vietnamese text.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Wed Jan 30 2002 - 01:30:43 EST