Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Philippe Verdy <>
Date: Tue, 5 Feb 2013 12:16:47 +0100

2013/2/5 Richard Wordingham <>:
> Philippe Verdy <> wrote:
>> But if the W3C needs to update
>> something, it's to say that ALL forms that are canonically equivalent
>> should be treated equally. This means that it is to the recipient of
>> encoded documents to perform their own normalization.
> The problem comes with applications that ignore canonical
> normalisation. The stability of Unicode normalisation is guaranteed so
> that application can ignore the normalisation process!

I've not spoken here about ay recommanded normalization, but ONLY
about canonical equivalence which should be preserved by conforming

A process can be FULLY conforming by preserving the canonical
equivalence and treating ALL strings that are canonically equivalent,
without having to normalize them in any recommanded form, or
performing any reordering in its backing store, or it can choose to
normalize to any other form that is convenient for that process (so it
could be NFC or NFD, or something else)

For example, typically when a web browser has a plain-text search tool
to look for some text present in the displayed page, it just needs to
perform collation with a level 1 strengh. Collation at level 1 does
not require ANY normalization, and can be performed by a simple 1-to-1
mapping, where canonically decomposable characters are mapped to a
single simpler form and a simple 1-to-1 case folding, and where all
combinjing diacritics are then filtered out as ignorable (if this is
the rule for level 1 collation in the searched language). Such process
will be FULLY conformant and will not depend on ANY normalization form
being used on the input web page. In addition it is easy to implement,
and very efficient (the simple 1-to-1 mapping table needed for such
strengh 1 collation is very small, you could do that with an binary
search or with a small hash table with very few collisions, or even no
collisions at all dependng on the hash function used).
Received on Tue Feb 05 2013 - 05:20:41 CST

This archive was generated by hypermail 2.2.0 : Tue Feb 05 2013 - 05:20:42 CST