Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 5 Feb 2013 20:19:01 +0000

On Tue, 5 Feb 2013 12:16:47 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> A process can be FULLY conforming by preserving the canonical
> equivalence and treating ALL strings that are canonically equivalent,
> without having to normalize them in any recommanded form,...

Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT,
U+0067 LATIN SMALL LETTER G> being a collation element (with arbitrary
collation elements) without doing normalisation. Consider how you
would handle <U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX, U+011D,
U+011D>!

> For example, typically when a web browser has a plain-text search tool
> to look for some text present in the displayed page, it just needs to
> perform collation with a level 1 strengh.

Perhaps, but I note that Firefox does at least level 2 matching for
Thai, and therefore will be vulnerable to vowels below following tone
marks, which are equivalent to vowels below preceding tone marks. The
former may be regarded as invalid by processes that are not Unicode
compliant (or are not processing Unicode text).

> Collation at level 1 does
> not require ANY normalization, and can be performed by a simple 1-to-1
> mapping, where canonically decomposable characters are mapped to a
> single simpler form and a simple 1-to-1 case folding, and where all
> combinjing diacritics are then filtered out as ignorable (if this is
> the rule for level 1 collation in the searched language).

Under the UCA defaults, Tibetan script vowels need some form of
normalisation for level 1 collation - length and quality indications
can be interchanged while preserving canonical equivalence, and both
contribute level 1 differences. These differences should be real for
Sanskrit.

Richard.
Received on Tue Feb 05 2013 - 14:23:04 CST

This archive was generated by hypermail 2.2.0 : Tue Feb 05 2013 - 14:23:04 CST