Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 6 Feb 2013 20:35:04 +0000

On Wed, 6 Feb 2013 10:18:33 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2013/2/5 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
>> On Tue, 5 Feb 2013 12:16:47 +0100
>> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

>>> A process can be FULLY conforming by preserving the canonical
>>> equivalence and treating ALL strings that are canonically
>>> equivalent, without having to normalize them in any recommanded
>>> form,...

>> Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT,
>> U+0067 LATIN SMALL LETTER G> being a collation element (with
>> arbitrary collation elements) without doing normalisation.
 
> <0302, 0067> is defective,...

So? The UCA default weighting necessarily has many 'defective'
collation elements - every character forms a collating element! There
are also a few 'defective' contractions, e.g. for <U+0F71 TIBETAN VOWEL
SIGN AA, U+0F72 TIBETAN VOWEL SIGN I>. The Burmese (my) collation in
the CLDR includes many 'defective' collating elements, such as
<U+102D MYANMAR VOWEL SIGN I, U+1000 MYANMAR LETTER KA, U+103A MYANMAR
SIGN ASAT>.

> ... and its normalisation is still <0302, 0067>, it is NOT canonically
> equivalent to <0067, 0302>

No one claimed that it was.

> I was not speaking about arbitrary collation elements containing
> defective sequences, is is a real case ?

When you said 'a process', I assumed you meant all processes.
Trivially, normalisation is not required for copying!

So far as I am aware, this is not a real case. It was a thought
experiment with a scheme for transliterating Sumerian, where
final /b/, /d/ and /g/ seem to have become silent. It is not
inconceivable that one might want to mark the silence of a consonant on
the preceding vowel, and that could have implications for sorting.
Another hypothetical possibility is a system of tone marking using
normal Latin letters (e.g. Yi) supplemented by accents for register or
length. If the accents were promoted from the secondary level to the
primary level, such collation elements (diacritic plus consonant, in
that order) could arise.

>> Consider how you
>> would handle <U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX, U+011D,
>> U+011D>!
 
> with which collation rule set ? including defective collection
> elements ?

Just add <U+0302, U+0067> to the default set of collating elements.
The NFC string <U+011D, U+011D, U+011D> then consists of the four
collating elements <U+0067>, <U+0302, U+0067>, <U+0302, U+0067>,
<U+0302>, and it is impossible to handle the NFC string without using
NFD normalisation.

Richard.
Received on Wed Feb 06 2013 - 14:41:20 CST

This archive was generated by hypermail 2.2.0 : Wed Feb 06 2013 - 14:41:21 CST