Re: Combining sequences (was: Unicode Public Review Issues update)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 03 2003 - 08:28:55 EDT

  • Next message: Andrew C. West: "Re: Documents needed for proposal"

    On Thursday, July 03, 2003 12:41 PM, Kent Karlsson <kentk@cs.chalmers.se> wrote:

    > > > There is no point in having a soft-dotted property for the capital
    > > > letter...
    > >
    > > No effectively, but the Soft_Dotted property interacts with the case
    > > conversions, so using or removing an addition dot before another
    > > combining diacritic must be explained.
    >
    > I don't know what you are after here (please don't explain). But
    > the soft-dotted property is for letters that typographically have
    > a dot (or two!) above, that "goes away" when another diacritic above
    > is applied. That excludes all known uppercase letters.

    Don't misread what I want to say here: I don't say that uppercase or
    titlecased version of digraphs containing a I or J shoulf be marked
    Soft_Dotted, as it is evident (?) that they should not have a dot above
    (there may exist exceptions in some decorated fonts, created mostly
    for English which typically does not use any accent, and where the
    presence or absence of a dot above these letters may be thought as
    "decorative" or stylistic, as part of the font design).

    > > And what about the other case-related <LJ>, <Lj>, <lj> digraphs?
    > > As well as the other case-related <NJ>, <Nj>, <nj> digraphs?
    > > (look at U+01C7 to U+01CC).
    >
    > Except for the uppercase ones, these were on my initial list (with a
    > question mark). But I'm quite content not to push for those. There
    > is no point. Those characters should not be used under any
    > circumstances anyway (though they are not formally deprecated).
    > I don't expect any system to handle combining characters applied to
    > those characters in anything but a very crude way.

    Unicode characters can be said "deprecated", or strongly discouraged
    howeer they are still valid, and then it's best to describe what should be
    their correct behavior. My question was there only for completeness,
    something that the Public Review Issues is supposed to enhance and
    document officially, even for "deprecated" characters.

    > > Is the dot removal correctly explained for a i with a non-above
    > > diacritic (for example U+1E2D: LATIN SMALL LETTER I WITH
    > > TILDE BELOW, which is decomposable as <U+0069,U+00330>:
    > > this letter keeps its dot above in both cases...
    >
    > ("both cases"?)
    >
    > The (typographic) dot(s) above should be removed if there is a
    > combining character of class 230 [centred above] in a combining
    > sequence starting with a soft-dotted character. The file
    > UCD-4.0.0.html only says "An accent placed on these characters...";
    > but the "on" here should be interpreted as "class 230". That could
    > be clarified.

    Thanks for admitting that the current description may easily be misread
    as meaning "any diacritic". With such misreading, a simple font renderer
    may just check the presence of the first diacritic to use a dotless glyph,
    even if that diacritic is not of class 230 (my example with <tilde-below>
    exhibits a possible problem that may come from a NFD decomposition,
    even if the Public Review lists the precomposed characters for which
    such Soft_Dotted property is or is not added).

    I would like to have exact comments of what "on" means: does it *only*
    refer to the class 230? What is the impact of format controls inserted
    in a combining sequence, and which are currently documented with the
    "Default_Ignorable_Code_Point" derived property?

    May be there should be some clarification in the current text defining
    what is a combining sequence (in terms of code points, and evidently
    not in terms of grapheme clusters, which is a font issue that Unicode
    cannot and should not handle, including for ligatures).

    This includes checking the definition for some scripts that have a
    strong structure:

    - There's no much difficulties with the alphabet scripts (those
    written mostly with codepoints below U+800), except for the
    currently discussed issue with Traditional Hebrew (and possibly
    Syriac, which has been standardized using the scholar convention
    ans script type, and possibly forgetting local conventions that were
    formally unified with the Western scholar type?), or the new
    discussion with Modern Greek Prosgegrammeni (iota adscript)
    and Classic Greek Ypogegrammeni (iota subscript)...

    - Hangul syllables are very well defined

    - Brahmic scripts should better be defined with a formal definition of
    the syllable boundaries that are de-facto creating combining
    sequences that span the strict definition of combining classes and
    include nuktas, consonnants with implied vowels, vowel modifiers
    and viramas,

    - Tibetan stacks are less well documented

    - We have little info about Ethiopic and Canadian Syllabic, about
    their possible extension with diacritics.

    Unicode 4.0 has probably been released for the list of codepoints
    supported, but the lack of preleased chapters describing these
    issues in a better wording that what is found in Unicode 3.0 is
    something that should not be neglected. Is it still time to propose
    a few changes of wording for the next coming preleased version
    of Unicode 4.0? Or will it be included in a future 4.1 update?

    Currently, codepoints are used not really as a way to represent
    actual abstract characters, but as a way to unify them with some
    common properties, without mulitplying the number of codepoints
    with too many precomposed astract characters. But a more
    formal description of the encoding sequences used for each script
    would be useful: the current description is really precise only for
    general scripts, Hangul, and Hiragana/Katakana.

    Some things are still not very precisely defined in ideographic
    scripts, and these are the ideographic description characters:
    are they intended to encode ideographs that have still not been
    encoded in Unihan? Are they used to allow creating indexing
    dictionnaries through which characters could be searched by
    radicals (using a dictionnary specific equivalence system, so
    that these decomposed ideographs are more or less defining
    a sort of equivalence system, even if not used with NF forms)?

    These have impact on what constitutes a valid sequence
    to encode an actual character, and what constitutes a
    "starter" code point, or where actual character boundaries
    occur in each script.



    This archive was generated by hypermail 2.1.5 : Thu Jul 03 2003 - 09:20:08 EDT