From: verdy_p (email@example.com)
Date: Tue Sep 08 2009 - 01:06:05 CDT
"Asmus Freytag" wrote:
> On 9/7/2009 6:13 PM, verdy_p wrote:
> > "Peter Constable"
> > wrote:
> >> A : "firstname.lastname@example.org"
> >> Copie à :
> >> Objet : RE: Visarga, ardhavisarga and anusvara -- combining marks or not?
> >> From: email@example.com [mailto:firstname.lastname@example.org] On Behalf Of Shriramana Sharma
> >>> To my mind, a combining mark is *usually* (though not always) something that
> >>> qualifies what is represented by a base character.
> >> Nothing in Unicode dictates what function in relation to reading or linguistic
> >> interpretation a combining mark should have.
> > Yes, but I still think that the main justification of the classification of a character as a combining mark
> > or not) must be looked for within collation, i.e. the top-level of analysis of text, where some differences are
> > considered less essential and then not given primary weights for searches, sorting,
> There are plenty of languages where there's no primary difference
> between some otherwise ordinary letters when it comes to sorting.
I did not say the opposite. It is just sufficient to know that there are widely used languistic or notational
conventions where these primary differences are essential, because I perfectily know that collation almost always
needs to be tailored (locally in a part of the default collation table) for almost all languages or notations.
(Including in English, because the DUCET is not made specifically for it, but just to reduce the number of changes
needed to tailor it in most languages, because English does not mandate anything on something else than just the
Latin part of the repertoire and generic considerations about the general punctuation, without being able to manage
some other specific notations and punctuation or letters and digits used only within other scripts, that it will
just sort out of its sequence with just generic rules).
This makes sense within the context of the default collation as described in TR39, but also by common practices when
handling foreign languages and notations, and also because the collation test is or should be the primary test to
validate many text transforms:
How do they manage these differences, or how do they fold them to produce consistant results? For all gc=Mc
characters that they don't know, they could handle them like ignorables, but not with gc=Lo letters. And the results
of the transforms should be consistant for the ignorable characters, if the transform is trying to preserve the
This is a more restrictive test than just the standard process conformance test (preservation of canonical
equivalences only), and it should guide the immplementers of these transform algorithms if they want to keep most of
the initial text semantics without dropping too much information. You can then repeat the test for all the locales
tfor which you want to test the process, using its localized collation just for the characters that are significant
to the language but the other characters will be still processed using the default collation rules, if they are
Unfortunately, the compatibility decompositions mappings, and compatibility normalizations (NFKC/NFKD) and some
other basic transforms (such as the simple case mappings in the main UCD file) all fail to this collation test (even
if just using the default collation order without additional tailoring), and that's why they should not be used at
all, except as a last-change fallback (for example when rendering if it's impossible to render the text without
using it) but not for text transforms from Unicode texts to some other Unicode texts.
Some other processes (out of text transforms) also make sense in terms of collation: notably the cluster breaking,
the line breaking properties, the word breaking properties... There are "simplified" algorithms that are supposed to
use rules not depending on collation, but even these algorithms need then to be tailored themselves. There is still
no conformance test for these extra tailorings of such algorithms that are coherent with the collation tailoring
used for the same languages or notation.
And I think that these tests could be made and automated, to make sure that the various algorithms are correctly
setup and don't forget important cases needed in some languages. Unfortunately there's nothing in TUS about the
various tailoring needed for specific languages (except in a few cases: complex case mappings, that have not been
consistantly tested according to UCA), because all UCA tailorings are out of scope of TUS, but only in the scope of
All that can be done in TUS should be to make sure that the few algorithms that are described with generic character
properties (like the normative gc property) will be successful to generate consitant results when using the
collation test with the DUCET. The extra tests for specific languages with their own UCA tailoring should be made
out of TUS (in CLDR for example), and should reveal where the generic algorithms specified in TUS or one of its
annexes are forgetting important cases where tailoring should be also possible and described to maintain the
coherence of all collations defined with a reference to the default collation.
If an existing standardized algorithm cannot comply to both the process conformance test and to the collation
coherence test, and all the other conformance tests described specifically in each alghorithm, then there's a bug or
hole in the standard annex describing the algorithm, or it may reveal that new distinct characters are needed for
correct handling of some languages (i.e. disunification, and in out context, possible reencoding of a character with
a different gc, or the addition of new properties or property values, or modifications of the standard algorithms
taking these additional properties or property values into account).
I've not even made any reference to any actual languages in what I wrote initially.
This archive was generated by hypermail 2.1.5 : Tue Sep 08 2009 - 01:08:24 CDT