Re: Unicode Search Engines

From: Stefan Probst (stefan.probst@opticom.v-nam.net)
Date: Wed Jan 30 2002 - 04:31:18 EST


Hello Doug,

concluding from how well you understood the issue (including your case 5),
one could think, you were Vietnamese ;)

It is exactly the "dot below" which makes the most problems, since its
combining class (220) is lower than some of the modifiers (230).
And unfortunately other tonal marks have the same combining class like
modifiers (230), and therefore the sorting seems to be not even specified!

To have the information together:
The modifiers, which change the base character to form a new character:
breve U+0306 combining class: 230
circumflex U+0302 combining class: 230
horn U+031B combining class: 216
The tonal marks, which have only a very loose connection with the character
(i.e. in handwriting they are often even placed above two adjacent vowels):
grave U+0300 combining class: 230
hook above U+0309 combining class: 230
tilde U+0303 combining class: 230
acute U+0301 combining class: 230
dot below U+0323 combining class: 220

I made already test pages, e.g. the one at
http://www.isoc-vn.org/www/standard/normalizationtest13.html

The issue runs even a bit further:

(1) Sorting
It is said, that in sorting, all combining marks should be disregarded.
While in Vietnamese this is OK for the (combining) tone marks, it is
absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an "a"
with "circumflex" is a completely different character than an "a" alone.
This is, why some circles in Vietnam prefer what I call "VN-combined": base
character and modifier pre-composed, tone mark combining.
(2) Converting
Inside of Vietnam, in the past, there were mainly two different encodings used:
- "TCVN-ABC": Fully pre-composed, but a separate font for some upper case
characters
- "VNI": Mainly using combining characters
When converting old documents (office and web) to Unicode, the question
will be, whether the tools will do any normalization (especially in case of
VNI), or just only re-map [combining] character by [combining] character.

And to make things worse, it seems, that MS prefers the combining way,
saying that their sorting, spell check, word wrap etc. works that way....

Vietnam plans to make Unicode compulsory for state offices by middle of 2002.
I have been asked to advise, and volunteered to take mainly care about
Internet issues.

Right now, in Vietnam they are still discussing, whether they should
require a specific normalization, and if so, which one of the four possible
candidates.

According to W3C's draft at http://www.w3.org/TR/charmod/#sec-Normalization
it seems, that all Web Applications (and that might include search
engines?) should reject (to be precise: MUST NOT handle) everything which
is not NFC. This could mean, that search engines MUST NOT index pages in
"not NFC" and reject queries in "not NFC". If they do: fine. If not: then
we have probably quite some problems...

And since we are already in Vietnamese.... (to round the things up):
I am not sure, how e.g. in the introduction to dictionaries or Vietnamese
language books, the tonal mark can be printed "alone". One solution might
be to combine them with a "space", but at present, this does not work always.
And only some of the tonal marks seem to have a "stand-alone version", e.g.
U+02CB for the "grave".

Best Regards,
Stefan

At 01:29 30.01.2002 -0500, DougEwell2@cs.com wrote:
-------------------------
>In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
>stefan.probst@opticom.v-nam.net writes:
>
> > I would like to add:
> > How do they handle normalization?
> > In Vietnam, many characters can be represented in several different ways:
> > (1) fully precomposed (NFC)
> > (2) base character and modifier precomposed, tonal mark combining
> > (3) base character, then modifier, then tonal mark
> > (4) like (3), but modifier and tonal mark sorted (NFD)
> > Do the search engines do any normalization, before indexing a page?
> > Are queries normalized before running the search?
>
>I'm not sure what sort of normalization might be performed by search engines,
>but I want to examine the Vietnamese decomposition aspect for a moment.
>
>If you have a Vietnamese vowel with both modifier and tone mark, say LATIN
>CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in
>Unicode in at least three ways:
>
>(1) fully precomposed (NFC) -- that is, U+1EA4
>(2) base character and modifier precomposed, tonal mark combining -- that is,
>U+00C2 U+0301
>(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302
>U+0301
>
>So far, so good. But then we have:
>
> > (4) like (3), but modifier and tonal mark sorted (NFD)
>
>If "sorting" the diacritical marks in NFD results in rearranging the two
>diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of
>Vietnamese orthography, the NFD form may not really be a legitimate way of
>representing the Vietnamese letter.
>
>For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is,
>in Vietnamese, a circumflexed A to which a tone mark (dot below) has been
>added. It is not a dotted-below A to which a circumflex has been added. Yet
>because of the canonical combining classes of the two diacriticals (230 for
>COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how
>the character will be decomposed.
>
>In theory, there is actually a case 5: base character and tonal mark
>precomposed, modifier combining. In terms of Vietnamese orthography, this is
>just as illegitimate as case 4 (NFD), but most software that processes
>Vietnamese text will probably never encounter it. But it will have to handle
>the NFD case.
>
>If I were on some other mailing lists I could think of, I would claim that
>this is a fatal flaw in the design of Unicode Normalization Form D. It's
>not, but it is a sticky problem that needs to be dealt with when dealing with
>Vietnamese text.
>
>-Doug Ewell
> Fullerton, California



This archive was generated by hypermail 2.1.2 : Wed Jan 30 2002 - 04:22:30 EST