Re: Combining latin small letters with diacritics from Philippe Verdy on 2012-03-11 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 12 Mar 2012 01:09:09 +0100

Note also that you have already accepted to encode characters like
COMBINING LATIN SMALL LETTER U WITH DIAERESIS.

The bad thing is that if it is used without any separator, it will not
clearly separate it from the orthographic level. So orthographic
checkers will choke on it. There's no clean way to indicate that to
spell checkers, or any other collators (that produce incorrect sorts).

So IF one wanted to indicate that precisely in texts, the special
combining character(s) I propose would solve the problem (without much
effort needed for having a reasonnable visual representation by text
renderers. In that case, the encoded COMBINING LATIN SMALL LETTER U
WITH DIAERESIS would be part of the subset (3,4) of the sequence I
give below.

This also means that it will NOT be canonically equivalent when
COMBINING LATIN SMALL LETTER U WITH DIAERESIS (encoded in position (3)
below, after the special character in position (2) I propose) will be
replaced by :
- (3) COMBINING LATIN SMALL LETTER U
- (4) COMBINING DIAERESIS
even though it will likely render identically. We can live with it: if
using the special character I propose, it would be simply better to
indicate that the existing "precombined" form you want to encode now,
will be preferred.

But many other new "precombined" diacritic letters with diacritics
won't need to be encoded, if they are not intended for orthographic
usage, but only for epigrapgic usage.

Le 12 mars 2012 00:56, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :
> One example: say you want to encode an epigraphic C with CEDILLA
> appearing as a letter above another one, you would encode :
>
> - (1) the orthographic base letter (with its standard diacritics,
> including CGJ if needed)
> - (2) the new special combining character with combining class 0 that I propose.
> - (3) the existing combiing letter C
> - (4) the existing combiing CEDILLA (or other existing diacritics,
> including CGJ if needed to avoid reorderings by normalizers).
>
> Renderers have hints given by the character (2) that they must not
> reorder/mix/compose randomly the characters between parts (1) and (3,
> 4). But they also have the hint that they can precompose safely the
> characters in (3, 4) without breaking anything, And they don't have to
> represent the character (2) itself (they could do it, still, using
> some other layout mechanisms).
>
> Semantic analysers know how to intepret characters in (3, 4) together,
> with their semantic level associated by them for the special character
> (2)
>
> Ortographic checkers know that characters (2,3,4) are to be ignored,
> they'll only check characters in (1), ignoring the rest as indicated
> by the character (2) for which they don't associate any orthographic
> meaning.
>
> Sorters continue to work (character (2, 3, 4) can be given a non null
> weight only in higher collation levels).
>
> Le 12 mars 2012 00:44, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :
>> Also I do think that this proposal would avoid havng to encode many
>> new "precomposed" diacritics made of a diacritic letter and a
>> diacritic applying to it. We would just encode them using such
>> separator first, before the encoded diacritic letter, and the standard
>> combining diacritics.
>>
>> With this tool, immediately, we can cover all scripts at once, for all
>> languages and all usages.
>>
>> Le 12 mars 2012 00:36, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :
>>> In other words, that circumflex is an epigraphic notation. This means
>>> three distinct levels of analysis of the text: one for Chi, one for
>>> the small letter above it noting something about the Chi, and another
>>> for the circumflex noting something about the Chi itself.
>>>
>>> This causes a major problem : how to separate cleanly those levels of
>>> representation when diacritics are NOT supposed to modify a letter
>>> orthographically ?
>>>
>>> 1) use an upper layer protocol (this is the position constantly
>>> adopted, but it has its limits).
>>>
>>> 2) use a special invisible combining character used as prefixes (with
>>> combining class 0 to avoid reorderings and other ambiguous combined
>>> forms caused ny normalizations) to separate and provide an unspecified
>>> additional semantic to the standard diacritics encoded after them.
>>>
>>> 3) Or possibly several of such special invisible combining characters
>>> in a coherent set (we could have 16 of them, encoded at once in one
>>> column in the special plane, each one with a numeric property which
>>> does not designate how it will be used in actual texts, in a way
>>> similar to the multiple variant selectors or multiple PUAs that are
>>> not very well fitted for combining characters), it if is needed to
>>> make semantic distinctions between these multiple (but optional)
>>> epigraphic levels.
>>>
>>>
>>> Le 11 mars 2012 14:06, Michael Everson <everson_at_evertype.com> a écrit :
>>>> On 11 Mar 2012, at 12:05, Denis Jacquerye wrote:
>>>>
>>>>> Stacked letters are also found in some Greek manuscripts.
>>>>> See the page http://www.archive.org/stream/revuearchologi27pariuoft#page/156/mode/1up
>>>>> with some examples: Nu, omicron, omicron and Greek circumflex (tilde),
>>>>> chi and Greek circumflex.
>>>>> Would these also have to be represented by combining characters?
>>>>
>>>> Yes, but in this case I don't think that circumflex is part of the superscript letter per se. It's a base letter with a combining letter, and the whole thing has a mark over it to show it's an abbreviation. (There is obviously no chi-circumflex in Greek orthography.)
>>>>
>>>> Michael Everson * http://www.evertype.com/
>>>>
>>>>
>>>>
Received on Sun Mar 11 2012 - 19:11:07 CDT

This archive was generated by hypermail 2.2.0 : Sun Mar 11 2012 - 19:11:07 CDT