Re: Unicode Search Engines

From: Mark Davis (mark@macchiato.com)
Date: Wed Jan 30 2002 - 10:30:06 EST


It is not a 'fatal flaw'. NFD makes to pretensions to represent the
most 'natural' ordering for any given language. Out of all the
possible canonically equivalent sequences, it is simply a specific,
well-defined, unique representation that is fully decomposed.

The issue of canonical equivalence itself is that that the circumflex
and dot-below can come in any order and have precisely the same
appearance, *and* that we could not predict the 'natural' order for
any given language.

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: <DougEwell2@cs.com>
To: <unicode@unicode.org>
Cc: <stefan.probst@opticom.v-nam.net>
Sent: Tuesday, January 29, 2002 22:51
Subject: Re: Unicode Search Engines

> In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> stefan.probst@opticom.v-nam.net writes:
>
> > I would like to add:
> > How do they handle normalization?
> > In Vietnam, many characters can be represented in several
different ways:
> > (1) fully precomposed (NFC)
> > (2) base character and modifier precomposed, tonal mark combining
> > (3) base character, then modifier, then tonal mark
> > (4) like (3), but modifier and tonal mark sorted (NFD)
> > Do the search engines do any normalization, before indexing a
page?
> > Are queries normalized before running the search?
>
> I'm not sure what sort of normalization might be performed by search
engines,
> but I want to examine the Vietnamese decomposition aspect for a
moment.
>
> If you have a Vietnamese vowel with both modifier and tone mark, say
LATIN
> CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
this in
> Unicode in at least three ways:
>
> (1) fully precomposed (NFC) -- that is, U+1EA4
> (2) base character and modifier precomposed, tonal mark combining --
that is,
> U+00C2 U+0301
> (3) base character, then modifier, then tonal mark -- that is,
U+0041 U+0302
> U+0301
>
> So far, so good. But then we have:
>
> > (4) like (3), but modifier and tonal mark sorted (NFD)
>
> If "sorting" the diacritical marks in NFD results in rearranging the
two
> diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
terms of
> Vietnamese orthography, the NFD form may not really be a legitimate
way of
> representing the Vietnamese letter.
>
> For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
BELOW is,
> in Vietnamese, a circumflexed A to which a tone mark (dot below) has
been
> added. It is not a dotted-below A to which a circumflex has been
added. Yet
> because of the canonical combining classes of the two diacriticals
(230 for
> COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
latter is how
> the character will be decomposed.
>
> In theory, there is actually a case 5: base character and tonal mark
> precomposed, modifier combining. In terms of Vietnamese
orthography, this is
> just as illegitimate as case 4 (NFD), but most software that
processes
> Vietnamese text will probably never encounter it. But it will have
to handle
> the NFD case.
>
> If I were on some other mailing lists I could think of, I would
claim that
> this is a fatal flaw in the design of Unicode Normalization Form D.
It's
> not, but it is a sticky problem that needs to be dealt with when
dealing with
> Vietnamese text.
>
> -Doug Ewell
> Fullerton, California
>
>



This archive was generated by hypermail 2.1.2 : Wed Jan 30 2002 - 10:04:01 EST