Re: Combining sequences (was: Unicode Public Review Issues update)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 03 2003 - 08:28:55 EDT

Next message: Andrew C. West: "Re: Documents needed for proposal"

Previous message: Anto'nio Martins-Tuva'lkin: "Documents needed for proposal"
Next in thread: Kent Karlsson: "RE: Combining sequences (was: Unicode Public Review Issues update)"
Reply: Kent Karlsson: "RE: Combining sequences (was: Unicode Public Review Issues update)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thursday, July 03, 2003 12:41 PM, Kent Karlsson <kentk@cs.chalmers.se> wrote:

> > > There is no point in having a soft-dotted property for the capital
> > > letter...
> >
> > No effectively, but the Soft_Dotted property interacts with the case
> > conversions, so using or removing an addition dot before another
> > combining diacritic must be explained.
>
> I don't know what you are after here (please don't explain). But
> the soft-dotted property is for letters that typographically have
> a dot (or two!) above, that "goes away" when another diacritic above
> is applied. That excludes all known uppercase letters.

Don't misread what I want to say here: I don't say that uppercase or
titlecased version of digraphs containing a I or J shoulf be marked
Soft_Dotted, as it is evident (?) that they should not have a dot above
(there may exist exceptions in some decorated fonts, created mostly
for English which typically does not use any accent, and where the
presence or absence of a dot above these letters may be thought as
"decorative" or stylistic, as part of the font design).

> > And what about the other case-related <LJ>, <Lj>, <lj> digraphs?
> > As well as the other case-related <NJ>, <Nj>, <nj> digraphs?
> > (look at U+01C7 to U+01CC).
>
> Except for the uppercase ones, these were on my initial list (with a
> question mark). But I'm quite content not to push for those. There
> is no point. Those characters should not be used under any
> circumstances anyway (though they are not formally deprecated).
> I don't expect any system to handle combining characters applied to
> those characters in anything but a very crude way.

Unicode characters can be said "deprecated", or strongly discouraged
howeer they are still valid, and then it's best to describe what should be
their correct behavior. My question was there only for completeness,
something that the Public Review Issues is supposed to enhance and
document officially, even for "deprecated" characters.

> > Is the dot removal correctly explained for a i with a non-above
> > diacritic (for example U+1E2D: LATIN SMALL LETTER I WITH
> > TILDE BELOW, which is decomposable as <U+0069,U+00330>:
> > this letter keeps its dot above in both cases...
>
> ("both cases"?)
>
> The (typographic) dot(s) above should be removed if there is a
> combining character of class 230 [centred above] in a combining
> sequence starting with a soft-dotted character. The file
> UCD-4.0.0.html only says "An accent placed on these characters...";
> but the "on" here should be interpreted as "class 230". That could
> be clarified.

Thanks for admitting that the current description may easily be misread
as meaning "any diacritic". With such misreading, a simple font renderer
may just check the presence of the first diacritic to use a dotless glyph,
even if that diacritic is not of class 230 (my example with <tilde-below>
exhibits a possible problem that may come from a NFD decomposition,
even if the Public Review lists the precomposed characters for which
such Soft_Dotted property is or is not added).

I would like to have exact comments of what "on" means: does it *only*
refer to the class 230? What is the impact of format controls inserted
in a combining sequence, and which are currently documented with the
"Default_Ignorable_Code_Point" derived property?

May be there should be some clarification in the current text defining
what is a combining sequence (in terms of code points, and evidently
not in terms of grapheme clusters, which is a font issue that Unicode
cannot and should not handle, including for ligatures).

This includes checking the definition for some scripts that have a
strong structure:

- There's no much difficulties with the alphabet scripts (those
written mostly with codepoints below U+800), except for the
currently discussed issue with Traditional Hebrew (and possibly
Syriac, which has been standardized using the scholar convention
ans script type, and possibly forgetting local conventions that were
formally unified with the Western scholar type?), or the new
discussion with Modern Greek Prosgegrammeni (iota adscript)
and Classic Greek Ypogegrammeni (iota subscript)...

- Hangul syllables are very well defined

- Brahmic scripts should better be defined with a formal definition of
the syllable boundaries that are de-facto creating combining
sequences that span the strict definition of combining classes and
include nuktas, consonnants with implied vowels, vowel modifiers
and viramas,

- Tibetan stacks are less well documented

- We have little info about Ethiopic and Canadian Syllabic, about
their possible extension with diacritics.

Unicode 4.0 has probably been released for the list of codepoints
supported, but the lack of preleased chapters describing these
issues in a better wording that what is found in Unicode 3.0 is
something that should not be neglected. Is it still time to propose
a few changes of wording for the next coming preleased version
of Unicode 4.0? Or will it be included in a future 4.1 update?

Currently, codepoints are used not really as a way to represent
actual abstract characters, but as a way to unify them with some
common properties, without mulitplying the number of codepoints
with too many precomposed astract characters. But a more
formal description of the encoding sequences used for each script
would be useful: the current description is really precise only for
general scripts, Hangul, and Hiragana/Katakana.

Some things are still not very precisely defined in ideographic
scripts, and these are the ideographic description characters:
are they intended to encode ideographs that have still not been
encoded in Unihan? Are they used to allow creating indexing
dictionnaries through which characters could be searched by
radicals (using a dictionnary specific equivalence system, so
that these decomposed ideographs are more or less defining
a sort of equivalence system, even if not used with NF forms)?

These have impact on what constitutes a valid sequence
to encode an actual character, and what constitutes a
"starter" code point, or where actual character boundaries
occur in each script.

Next message: Andrew C. West: "Re: Documents needed for proposal"
Previous message: Anto'nio Martins-Tuva'lkin: "Documents needed for proposal"
Next in thread: Kent Karlsson: "RE: Combining sequences (was: Unicode Public Review Issues update)"
Reply: Kent Karlsson: "RE: Combining sequences (was: Unicode Public Review Issues update)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jul 03 2003 - 09:20:08 EDT