Re: Visarga, ardhavisarga and anusvara -- combining marks or not?

From: verdy_p (
Date: Mon Sep 07 2009 - 17:07:27 CDT

  • Next message: verdy_p: "Re: Run-time checking of fonts for Sinhala support"

    "Asmus Freytag"
    > The second is the radical solution: reclassify every single
    > character from Mc to Lo where there isn't any compelling
    > reason (in rendering or processing) to consider that
    > character actually "combining" in function, not just in name.
    > The advantage of this approach is that it would be very
    > visible and direct. Treating an "Lo" character by using the
    > support for graphically combining characters in a
    > renderer is obviously wrong, so you might expect a
    > pressure on *all* implementations to get that corrected.
    > The downside, of course, is that it's impossible to predict
    > what uses the gc=Mc classification has been put to by
    > actual implementations, outside of simple rendering issues.
    > You are correct in calling such an approach destabilizing,
    > no matter how appealing it would be, otherwise. For
    > the same reason, UTC is correct to continue to be
    > consistent with past practice in assigning Mc to any new
    > characters that are analogues to existing Mc characters.

    This solution would be much too radical. Effectively, if you are speaking about rendering Mc character, they should
    be rendered like other cg=Lo characters and handled with the its simpler model (which does not have to focus on
    combining marks and the ill-named "non-spacing" or "spacing" dichotimy between all combining marks, but would focus
    only on possible ligatures and/or conjuncts, i.e. the preferred ligated forms).

    But the main problem you'll have is that it would change how many other uses, outside just rendering, will be
    implemented. Notably for handling full-text searches: the gc=Mc classification effectively makes a clear split in
    the order of importance with gc=Lo letters that are considered much more important and absolutely needed for every
    search at the primary level, as soon as you are trying to cope with variable orthographies. The gc=Mc marks are
    effectively not always present in all texts or not presented to users in all styles (so they effectively have cases
    where they are effectively not rendered at all, even if they are encoded, to accomodate with these presentation
    styles, for example in titling and monumental scriptures, or in summaries and book indexes, and even in
    dictionnaries or diaries for the general classification of words).

    You may argue that a well-behaved collation algorithm should not depend on gc classification, and that collation
    still needs to be tailored for a lot of languages. But the reality is that even the default collation table, used as
    the root for all tailoring, needs to be mainteained to built up from the ground by first looking at the gc
    classification. If you change the gc massively, you will break a lot of existing collation algorithms, unless they
    are built on top of a full copy of the DUCET. You will also have difficulties, at Unicode, to maintain the DUCET for
    the future, because the primary or secondary level of "importance" of characters is not tracked anywhere else in the

    My opinion is that this radical change is absolutely not needed. The standard just needs to say that gc=Mc
    characters should be treated like gc=Lo for rendering, and ONLY for rendering purpose, and ONLY if those characters
    are effectively rendered, because there does exist contexts within which they will not be rendered with the rest of
    the text when some styles are applied. In my pojnt of view I'm not saying that the gc=Mc character are not usefull
    orthographically, but just that they have a secondary role, and they can be used as optional, notational-like,
    additions on top of a simpler text, just in the same way as how you can analyse, in a multi-level approach, fully
    pointed Hebrew or Arabic texts, or epigraphic Greek, where a lot of additional marks were written, sometimes with
    very strict orthographic or stylistic rules, to complement the primary level of text.

    It also happens that some gc=Mc marks have also changed their role over the history, between being considered as
    plain letters, or being just additonal optional marks. This role may also vary between several distinct languages,
    including in the modern use, where some may have disappaered from the usual orthography, and som other have been
    promoted to being used alsmost systematically to disambiguate some words or the oral spelling.

    The gc=Mc reclassification as gc=Lo would just SIPPOSEDLY simplify the rendering. But in my opinion it is not
    needed: designers of fonts and renderers just have to be prompted to treat these characters like base letters when
    they have to render them, so they must not use the dotted circle for example for the sequences they don't recognize
    with their linguistic rules: it's not up to the renderer or font to work with linguistic issues, unless it is
    impossible to get the correct rendering needed and expected for specific languages.

    Anyway, I don't think that existing implementations exhibit the major rendering problems that what you propose to
    solve with such radical change. The problems do exist, but this is at another level, that does not just involve the
    gc=Mc characters, but clusters of letters (such as Indic consonnants, with multiple forms: full, half, subjoined,
    post-joined, halant-below, and ligatures/conjuncts, when some of the letters are used with virama and sometimes with
    additional (dis)joiner controls). In fact, if you change gc=Mc to gc=Lo, you will add even more complication to the
    algorithms that have to handle the Indic variable forms, because the first thing they will have to manage is the
    identification of base letters and how to identify the letter clusters delimitations...

    This archive was generated by hypermail 2.1.5 : Mon Sep 07 2009 - 17:10:51 CDT