Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Philippe Verdy <>
Date: Sun, 10 Feb 2013 12:21:05 +0100

2013/2/7 Richard Wordingham <>:
> You said, on 5 February,
> "A process can be FULLY conforming by preserving the canonical
> equivalence and treating ALL strings that are canonically equivalent,
> without having to normalize them in any recommanded form, or
> performing any reordering in its backing store, or it can choose to
> normalize to any other form that is convenient for that process (so it
> could be NFC or NFD, or something else)"
> There's no qualification there disqualifying collation at the secondary
> level from being a 'process' which may or may not be conforming.

Citing this email, the restriction to primary level was included
before this sentence, and implied. You just did not quote it along
with this. Be careful about taking senetencves out of their contexts,
when the whole thread started by spekaing about primary level only for
basic searches.

OK there are some pathological cases but they are really constructed
and not made for modern languages (except a fex Indic ones as you
noted), but none of them that concern the Latin script (your <TILDE+V>
example collating like <N> is not an effective true example, it is
fully constructed and not found in the CLDR).

If you just consider the initial question, having to decompose letters
to "recompose" them in defective ways just to create rare single
collation elements remains a very borderline case for applications
like browsers that just perform plain-text search at primary level on
a web page. Even if the implementation really uses a full
decomposition, I doubt it even has any implemented tailoring that
would recognize those defective collation elements

When it is used for example in old Medieval texts where tildes are
used as abbreviation marks with some unclear meaning anyway and that
would be more safely interpreted like the abbreviation dot we more
commonly see today ; there are other notations for abbreviations that
even a full UCA implmeentation will not recognize, notably the use of
superscripts or subscripts when they are not using superscript or
subscript characters, but any standard baseline characters with
styling elements like HTML sub/sup elements or spans with CSS styles,
and no other encoded invisible control to denote the meaning of this
superscript as an abbreviation. Sim!ilar issues occur when there are
some other styles like strokes/underlines/overlines (i.e.
text-decoration in CSS), and that a plain-text only search will not
recognize (and certainly not if it's working only at collation level
Received on Sun Feb 10 2013 - 05:28:18 CST

This archive was generated by hypermail 2.2.0 : Sun Feb 10 2013 - 05:28:25 CST