Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri May 06 2005 - 20:11:51 CDT

Next message: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"

Previous message: Rick McGowan: "UTS #6 SCSU Update released"
In reply to: Kenneth Whistler: "Re: Cyrillic - accented/acuted vowels"
Next in thread: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Maybe reply: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 07/05/2005 01:02, Kenneth Whistler wrote:

>>On 06/05/2005 21:35, Philippe Verdy wrote:
>>
>>
>>
>>>...
>>>
>>>So a good question is:
>>>Can a "Unicode Named Character Sequence" be recognized as a single
>>>entity, when there are other combining characters in the middle of the
>>>sequence,
>>>
>>>
>
>No. The specification should be clear.
>
>A Unicode Named Character Sequence is a specific sequence of
>code points associated with a name.
>
>
>
Does it have to be a sequence which is stable under all kinds of
canonical transformation? Or just under normalisation? Can it ever be a
sequence of a base character and a combining character (of combining
class greater than 1)? If it can, then there is always the possibility
that a combining character of lower combining class is also combined
with the same base character, which means that the sequence is not
stable under normalisation. But several of the examples given in UAX #34
are such sequences, which are not stable under normalisation. This is
the issue which Philippe was trying to address, as I understood it.

>It is not a maximal set of canonically equivalent sequences of
>code points associated with a name.
>
>
>
>>>and when moving those extra combining characters at end of
>>>the named sequence is still canonically equivalent? My opinion is that
>>>such named sequence should still be recognized (due to the canonical
>>>equivalence), to help for interoperability.
>>>
>>>
>>>
>>I agree,
>>
>>
>
>And I disagree, because this is not the problem that
>Unicode Named Character Sequences were aimed at.
>
>
>
This appears to be the problem that these sequences were aimed at,
quoting UAX #34:

> Having a conventional notation for sequences of Unicode code points
> treated as a unit is useful in a number of circumstances. For example,
> other standards may need to refer to entities which are represented in
> Unicode by sequences of characters. Mapping tables may map single
> characters in other standards to sequences of Unicode characters. And
> listings of repertoire coverage for fonts or keyboards may need to
> reference entities which do not correspond to single Unicode code points.

The issue which I was considering, I'm not sure about Philippe, was
cases of mapping tables between Unicode and existing de facto standards,
for fonts and keyboards, in which a single entity in the de facto
standard corresponds to a character sequence in Unicode, and one which
is not stable under normalisation. In fact precisely this is true of the
example named sequence LATIN SMALL LETTER I WITH MACRON AND GRAVE, which
is not stable under normalisation when followed by a combining character
of lower combining class than that of MACRON and GRAVE. This may not be
a practical issue in Livonian, but it is a practical issue in some other
languages (and one which has been complicated by Unicode's choice of
combining classes).

The problem is of course found when converting from Unicode to one of
the "other standards" referred to above. A converter may expect the
character sequence to be an uninterrupted sequence, and may fail to
recognise a sequence if interrupted because of canonical reordering. If
it recognises a sequence in one order but not in a canonically
equivalent form, then it is not doing what it is supposed to do. Of
course this is a problem whether or not the sequence is formally a
Unicode character sequence.

> ...
>
>
>
>>e.g. for such
>>meaningful concepts as HEBREW LETTER SIN WITH DOT and HEBREW LETTER SHIN
>>WITH DOT, because these are commonly combined with other combining
>>characters of lower combining class than SIN DOT and SHIN DOT.
>>
>>
>
>Such textual elements are already represented using the
>standard, as either:
>
><U+05E9, U+05C2>
>
>or as:
>
>U+FB2B HEBREW LETTER SHIN WITH SIN DOT
>
>-- which two are canonically equivalent sequences.
>
>Creating a name for the first sequence would be pointless, since
>there already *is* a character name for a canonically equivalent
>encoded character. ...
>

I agree that it would be redundant to create a named sequence in this
case. When I wrote before I had temporarily forgotten about this
deprecated presentation form. But it does seem strange that, for a
character table in which the Unicode side of the mapping must be either
a single character or a named sequence, you are now proposing that a
presentation form should be used although use of these forms has been
deprecated.

If presentation forms are used in this way, converters from Unicode to
legacy standards will need to be aware of their decompositions, as if
they are presented with normalised Unicode input this will never be
composed into presentation form characters. In practice I suspect that
such a converter will need to operate on NFD and work with decomposed
forms of all characters with canonical decompositions, which for the
purposes of the converter will be equivalent to character sequences.

>... And besides, nobody is requiring formal names to
>be given for every character sequence that might be used -- particularly
>when you start considering for Hebrew all the potential sequences
>that could be involved in Biblical text representation.
>
>

Nobody has even suggested this. There is a rather small set of Hebrew
character sequences concerning which "Having a conventional notation for
sequences of Unicode code points treated as a unit is useful in a number
of circumstances" such as those defined in UAX #34. Almost all of these
character sequences are already canonically equivalent to presentation
forms, and as such there is no need for new named sequences. There are
other sequences which might merit being defined because they are often
treated as units for typographical and keyboarding purposes, e.g. FINAL
MEM WITH DAGESH and FINAL NUN WITH DAGESH are used occasionally (see
examples in
http://www.qaya.org/academic/hebrew/Ketiv-Qere-difficult.pdf, bottom of
p.3 and top of p.4).

>Trying to invent some "meaningful concept" for HEBREW LETTER
>SIN WITH DOT which is different from one of the two representations
>above in some way ...
>

I am not trying to do this.

>... is just a recipe for *non*-interoperability
>with the standard and implementations of it, rather than
>helping any.
>
>Or perhaps what you really have in mind is:
>
>HEBREW LETTER SIN WITH DOT BECAUSE THE UTC SCREWED UP THE
> CANONICAL CLASS ASSIGNMENT OF HEBREW COMBINING MARKS
>
>Would that suffice?
>
>

No, Ken, this is not what I have in mind. Why do you assume that I am
trying to stir up trouble on this issue? It looks to me as if there may
be a real problem over named character sequences and canonical
stability. The one example I picked was not the best because the
canonically equivalent presentation form does exist. But a case could be
made for defining HEBREW LETTER FINAL MEM WITH DAGESH as a named
character sequence (or for that matter for defining it as a new
presentation form, filling in the hole at U+FB3D), and if this decision
was made there would then be a problem of normalisation stability when
this sequence is further combined with QAMETS - which is a combination
actually found in the Hebrew Bible.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.5 - Release Date: 04/05/2005

Next message: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Previous message: Rick McGowan: "UTS #6 SCSU Update released"
In reply to: Kenneth Whistler: "Re: Cyrillic - accented/acuted vowels"
Next in thread: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Maybe reply: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 20:12:53 CDT