Re: Visarga, ardhavisarga and anusvara -- combining marks or not?

From: verdy_p (
Date: Tue Sep 08 2009 - 01:06:05 CDT

  • Next message: verdy_p: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"

    "Asmus Freytag" wrote:
    > On 9/7/2009 6:13 PM, verdy_p wrote:
    > > "Peter Constable"
    > > wrote:
    > >
    > >> A : ""
    > >> Copie à :
    > >> Objet : RE: Visarga, ardhavisarga and anusvara -- combining marks or not?
    > >>
    > >>
    > >> From: [] On Behalf Of Shriramana Sharma
    > >>
    > >>
    > >>> To my mind, a combining mark is *usually* (though not always) something that
    > >>> qualifies what is represented by a base character.
    > >>>
    > >> Nothing in Unicode dictates what function in relation to reading or linguistic
    > >> interpretation a combining mark should have.
    > >>
    > >
    > > Yes, but I still think that the main justification of the classification of a character as a combining mark
    > > or not) must be looked for within collation, i.e. the top-level of analysis of text, where some differences are
    > > considered less essential and then not given primary weights for searches, sorting,
    > There are plenty of languages where there's no primary difference
    > between some otherwise ordinary letters when it comes to sorting.

    I did not say the opposite. It is just sufficient to know that there are widely used languistic or notational
    conventions where these primary differences are essential, because I perfectily know that collation almost always
    needs to be tailored (locally in a part of the default collation table) for almost all languages or notations.

    (Including in English, because the DUCET is not made specifically for it, but just to reduce the number of changes
    needed to tailor it in most languages, because English does not mandate anything on something else than just the
    Latin part of the repertoire and generic considerations about the general punctuation, without being able to manage
    some other specific notations and punctuation or letters and digits used only within other scripts, that it will
    just sort out of its sequence with just generic rules).

    This makes sense within the context of the default collation as described in TR39, but also by common practices when
    handling foreign languages and notations, and also because the collation test is or should be the primary test to
    validate many text transforms:

    How do they manage these differences, or how do they fold them to produce consistant results? For all gc=Mc
    characters that they don't know, they could handle them like ignorables, but not with gc=Lo letters. And the results
    of the transforms should be consistant for the ignorable characters, if the transform is trying to preserve the
    primary differences.

    This is a more restrictive test than just the standard process conformance test (preservation of canonical
    equivalences only), and it should guide the immplementers of these transform algorithms if they want to keep most of
    the initial text semantics without dropping too much information. You can then repeat the test for all the locales
    tfor which you want to test the process, using its localized collation just for the characters that are significant
    to the language but the other characters will be still processed using the default collation rules, if they are

    Unfortunately, the compatibility decompositions mappings, and compatibility normalizations (NFKC/NFKD) and some
    other basic transforms (such as the simple case mappings in the main UCD file) all fail to this collation test (even
    if just using the default collation order without additional tailoring), and that's why they should not be used at
    all, except as a last-change fallback (for example when rendering if it's impossible to render the text without
    using it) but not for text transforms from Unicode texts to some other Unicode texts.

    Some other processes (out of text transforms) also make sense in terms of collation: notably the cluster breaking,
    the line breaking properties, the word breaking properties... There are "simplified" algorithms that are supposed to
    use rules not depending on collation, but even these algorithms need then to be tailored themselves. There is still
    no conformance test for these extra tailorings of such algorithms that are coherent with the collation tailoring
    used for the same languages or notation.

    And I think that these tests could be made and automated, to make sure that the various algorithms are correctly
    setup and don't forget important cases needed in some languages. Unfortunately there's nothing in TUS about the
    various tailoring needed for specific languages (except in a few cases: complex case mappings, that have not been
    consistantly tested according to UCA), because all UCA tailorings are out of scope of TUS, but only in the scope of

    All that can be done in TUS should be to make sure that the few algorithms that are described with generic character
    properties (like the normative gc property) will be successful to generate consitant results when using the
    collation test with the DUCET. The extra tests for specific languages with their own UCA tailoring should be made
    out of TUS (in CLDR for example), and should reveal where the generic algorithms specified in TUS or one of its
    annexes are forgetting important cases where tailoring should be also possible and described to maintain the
    coherence of all collations defined with a reference to the default collation.

    If an existing standardized algorithm cannot comply to both the process conformance test and to the collation
    coherence test, and all the other conformance tests described specifically in each alghorithm, then there's a bug or
    hole in the standard annex describing the algorithm, or it may reveal that new distinct characters are needed for
    correct handling of some languages (i.e. disunification, and in out context, possible reencoding of a character with
    a different gc, or the addition of new properties or property values, or modifications of the standard algorithms
    taking these additional properties or property values into account).

    I've not even made any reference to any actual languages in what I wrote initially.

    This archive was generated by hypermail 2.1.5 : Tue Sep 08 2009 - 01:08:24 CDT