Umlaut and Tréma, was: Variation selectors and vowel marks

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Jul 13 2004 - 13:02:15 CDT

Next message: Doug Ewell: "Re: Umlaut and TrÃ©ma, was: Variation selectors and vowel marks"

Previous message: Christopher Fynn: "Re: Changing UCA primary weights (bad idea)"
Next in thread: Doug Ewell: "Re: Umlaut and TrÃ©ma, was: Variation selectors and vowel marks"
Reply: Doug Ewell: "Re: Umlaut and TrÃ©ma, was: Variation selectors and vowel marks"
Reply: Asmus Freytag: "Re: Umlaut and Tréma, was: Variation sele ctors and vowel marks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I was surprised to see that WG2 has accepted a proposal made by the US
National Body (is this not more or less the same as the UTC) to use CGJ
to distinguish between Umlaut and Tréma in German bibliographic data.
See http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2819.pdf for the proposal;
see also http://std.dkuug.dk/jtc1/sc2/wg2/docs/N2754.pdf resolution
M45.13 for the minutes of its acceptance, and
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2766r.pdf for further background
and an earlier, rejected proposal.

Back in April this year there was a long discussion on this list of
possible extensions of the variation selector mechanism to apply to
combining marks. But there was a strong feeling among UTC members etc
that this was unacceptable because of its effect on normalisation
stability. For example:

On 25/04/2004 00:30, Asmus Freytag wrote:

> At 04:00 PM 4/24/2004, Peter Kirk wrote:
>
>>> There are tons of problems once one adds in other combining marks
>>> being applied to the character as well, because then under
>>> normalization,
>>> unless the mark you were applying the variation selector to is of
>>> combining class 0, you can't assure that the variation selector will
>>> stay with the mark. Having the existing Variation Selectors behave
>>> in that way would break the normalization stability guarantee, ...
>>
>>
>> This is untrue. Normalisation stability does not apply when the text
>> is changed, and inserting a variation selector is a change to the
>> text. I have never suggested changing the combining class or other
>> normalisation properties of existing VSs. The way to ensure that a VS
>> stays with the mark it applies to is to ensure that in the part of
>> the combining character sequence before the VS all combining
>> characters are already in canonical order. Well, I can see that there
>> are potential problems where there are canonical decompositions
>> (which are not composition exclusions), but that does not apply to
>> the cases I am interested in.
>
>
> Because of normalization stability, the combining class of all
> existing variation selectors must remain at 0. A character of class 0
> interrupts canonical reordering (so that, for example, accent marks
> inside and outside an enclosing mark don't switch places).
>
> Unnormalized data is perfectly legal in Unicode and *must* and just as
> equivalent to normalized data as the composed and decomposed
> normalized forms are to each other. [The rules are for that are in
> chapter 3].
>
> Therefore, any scheme that only works if data is always normalized is
> not feasible.
>
> You can dream of new types of characters, which have different
> combining classes, but then, by your own admission in another part of
> this thread, you would be forced to add new characters.
>
> We have purposefully added a large number of variation selectors so
> that software can be built today that robustly covers all those
> processes where the variation selectors can be ignored. As I pointed
> out in my last message, it's a defining characteristic of variation
> selectors that there are many processes for which they should be ignored.
>
> Because of that,it would be *much* easier to even add 6 1/2 dozen new
> combining characters than a single 'specialized' new type of variation
> selector.
>
But now it seems that WG2, and apparently also the UTC, has decided to
accept an encoding using CGJ as a pseudo-variation selector applied to a
combining mark (although positioned before it instead of after it),
despite it having all of the effects of confusing normalisation which
Asmus describes so clearly above - which are even worse in this case
because of canonical equivalences. (In practice the new combination for
tréma may be used very rarely in combination with other combining marks,
but that argument didn't wash before.) The encoding using CGJ also seems
to be overloading this character which is intended for something quite
different.

It seems to me that the UTC should bite the bullet and accept that there
is a need for variation sequences for combining marks, and either adjust
the definitions of existing variation selectors or encode new
specialised variation selectors for them. The adjusted or new variation
selectors can then be used for Hebrew as well as for German - see my
posting on this subject to the Hebrew list.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Previous message: Christopher Fynn: "Re: Changing UCA primary weights (bad idea)"
Reply: Asmus Freytag: "Re: Umlaut and Tréma, was: Variation sele ctors and vowel marks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 13 2004 - 13:03:17 CDT