Re: Unicode Search Engines

From: Misha.Wolf@reuters.com
Date: Wed Jan 30 2002 - 10:48:10 EST


On 30/01/2002 15:30:06 Mark Davis wrote:
> It is not a 'fatal flaw'. NFD makes to pretensions to represent the

I imagine that "to" -> "no".

Misha

> most 'natural' ordering for any given language. Out of all the
> possible canonically equivalent sequences, it is simply a specific,
> well-defined, unique representation that is fully decomposed.
>
> The issue of canonical equivalence itself is that that the circumflex
> and dot-below can come in any order and have precisely the same
> appearance, *and* that we could not predict the 'natural' order for
> any given language.
>
> Mark
> —————
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
>
> ----- Original Message -----
> From: <DougEwell2@cs.com>
> To: <unicode@unicode.org>
> Cc: <stefan.probst@opticom.v-nam.net>
> Sent: Tuesday, January 29, 2002 22:51
> Subject: Re: Unicode Search Engines
>
>
> > In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> > stefan.probst@opticom.v-nam.net writes:
> >
> > > I would like to add:
> > > How do they handle normalization?
> > > In Vietnam, many characters can be represented in several
> different ways:
> > > (1) fully precomposed (NFC)
> > > (2) base character and modifier precomposed, tonal mark combining
> > > (3) base character, then modifier, then tonal mark
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > Do the search engines do any normalization, before indexing a
> page?
> > > Are queries normalized before running the search?
> >
> > I'm not sure what sort of normalization might be performed by search
> engines,
> > but I want to examine the Vietnamese decomposition aspect for a
> moment.
> >
> > If you have a Vietnamese vowel with both modifier and tone mark, say
> LATIN
> > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
> this in
> > Unicode in at least three ways:
> >
> > (1) fully precomposed (NFC) -- that is, U+1EA4
> > (2) base character and modifier precomposed, tonal mark combining --
> that is,
> > U+00C2 U+0301
> > (3) base character, then modifier, then tonal mark -- that is,
> U+0041 U+0302
> > U+0301
> >
> > So far, so good. But then we have:
> >
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> >
> > If "sorting" the diacritical marks in NFD results in rearranging the
> two
> > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
> terms of
> > Vietnamese orthography, the NFD form may not really be a legitimate
> way of
> > representing the Vietnamese letter.
> >
> > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
> BELOW is,
> > in Vietnamese, a circumflexed A to which a tone mark (dot below) has
> been
> > added. It is not a dotted-below A to which a circumflex has been
> added. Yet
> > because of the canonical combining classes of the two diacriticals
> (230 for
> > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
> latter is how
> > the character will be decomposed.
> >
> > In theory, there is actually a case 5: base character and tonal mark
> > precomposed, modifier combining. In terms of Vietnamese
> orthography, this is
> > just as illegitimate as case 4 (NFD), but most software that
> processes
> > Vietnamese text will probably never encounter it. But it will have
> to handle
> > the NFD case.
> >
> > If I were on some other mailing lists I could think of, I would
> claim that
> > this is a fatal flaw in the design of Unicode Normalization Form D.
> It's
> > not, but it is a sticky problem that needs to be dealt with when
> dealing with
> > Vietnamese text.
> >
> > -Doug Ewell
> > Fullerton, California
> >
> >
>
>

-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Wed Jan 30 2002 - 10:20:03 EST