Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri May 06 2005 - 20:11:51 CDT

  • Next message: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"

    On 07/05/2005 01:02, Kenneth Whistler wrote:

    >>On 06/05/2005 21:35, Philippe Verdy wrote:
    >>
    >>
    >>
    >>>...
    >>>
    >>>So a good question is:
    >>>Can a "Unicode Named Character Sequence" be recognized as a single
    >>>entity, when there are other combining characters in the middle of the
    >>>sequence,
    >>>
    >>>
    >
    >No. The specification should be clear.
    >
    >A Unicode Named Character Sequence is a specific sequence of
    >code points associated with a name.
    >
    >
    >
    Does it have to be a sequence which is stable under all kinds of
    canonical transformation? Or just under normalisation? Can it ever be a
    sequence of a base character and a combining character (of combining
    class greater than 1)? If it can, then there is always the possibility
    that a combining character of lower combining class is also combined
    with the same base character, which means that the sequence is not
    stable under normalisation. But several of the examples given in UAX #34
    are such sequences, which are not stable under normalisation. This is
    the issue which Philippe was trying to address, as I understood it.

    >It is not a maximal set of canonically equivalent sequences of
    >code points associated with a name.
    >
    >
    >
    >>>and when moving those extra combining characters at end of
    >>>the named sequence is still canonically equivalent? My opinion is that
    >>>such named sequence should still be recognized (due to the canonical
    >>>equivalence), to help for interoperability.
    >>>
    >>>
    >>>
    >>I agree,
    >>
    >>
    >
    >And I disagree, because this is not the problem that
    >Unicode Named Character Sequences were aimed at.
    >
    >
    >
    This appears to be the problem that these sequences were aimed at,
    quoting UAX #34:

    > Having a conventional notation for sequences of Unicode code points
    > treated as a unit is useful in a number of circumstances. For example,
    > other standards may need to refer to entities which are represented in
    > Unicode by sequences of characters. Mapping tables may map single
    > characters in other standards to sequences of Unicode characters. And
    > listings of repertoire coverage for fonts or keyboards may need to
    > reference entities which do not correspond to single Unicode code points.

    The issue which I was considering, I'm not sure about Philippe, was
    cases of mapping tables between Unicode and existing de facto standards,
    for fonts and keyboards, in which a single entity in the de facto
    standard corresponds to a character sequence in Unicode, and one which
    is not stable under normalisation. In fact precisely this is true of the
    example named sequence LATIN SMALL LETTER I WITH MACRON AND GRAVE, which
    is not stable under normalisation when followed by a combining character
    of lower combining class than that of MACRON and GRAVE. This may not be
    a practical issue in Livonian, but it is a practical issue in some other
    languages (and one which has been complicated by Unicode's choice of
    combining classes).

    The problem is of course found when converting from Unicode to one of
    the "other standards" referred to above. A converter may expect the
    character sequence to be an uninterrupted sequence, and may fail to
    recognise a sequence if interrupted because of canonical reordering. If
    it recognises a sequence in one order but not in a canonically
    equivalent form, then it is not doing what it is supposed to do. Of
    course this is a problem whether or not the sequence is formally a
    Unicode character sequence.

    > ...
    >
    >
    >
    >>e.g. for such
    >>meaningful concepts as HEBREW LETTER SIN WITH DOT and HEBREW LETTER SHIN
    >>WITH DOT, because these are commonly combined with other combining
    >>characters of lower combining class than SIN DOT and SHIN DOT.
    >>
    >>
    >
    >Such textual elements are already represented using the
    >standard, as either:
    >
    ><U+05E9, U+05C2>
    >
    >or as:
    >
    >U+FB2B HEBREW LETTER SHIN WITH SIN DOT
    >
    >-- which two are canonically equivalent sequences.
    >
    >Creating a name for the first sequence would be pointless, since
    >there already *is* a character name for a canonically equivalent
    >encoded character. ...
    >

    I agree that it would be redundant to create a named sequence in this
    case. When I wrote before I had temporarily forgotten about this
    deprecated presentation form. But it does seem strange that, for a
    character table in which the Unicode side of the mapping must be either
    a single character or a named sequence, you are now proposing that a
    presentation form should be used although use of these forms has been
    deprecated.

    If presentation forms are used in this way, converters from Unicode to
    legacy standards will need to be aware of their decompositions, as if
    they are presented with normalised Unicode input this will never be
    composed into presentation form characters. In practice I suspect that
    such a converter will need to operate on NFD and work with decomposed
    forms of all characters with canonical decompositions, which for the
    purposes of the converter will be equivalent to character sequences.

    >... And besides, nobody is requiring formal names to
    >be given for every character sequence that might be used -- particularly
    >when you start considering for Hebrew all the potential sequences
    >that could be involved in Biblical text representation.
    >
    >

    Nobody has even suggested this. There is a rather small set of Hebrew
    character sequences concerning which "Having a conventional notation for
    sequences of Unicode code points treated as a unit is useful in a number
    of circumstances" such as those defined in UAX #34. Almost all of these
    character sequences are already canonically equivalent to presentation
    forms, and as such there is no need for new named sequences. There are
    other sequences which might merit being defined because they are often
    treated as units for typographical and keyboarding purposes, e.g. FINAL
    MEM WITH DAGESH and FINAL NUN WITH DAGESH are used occasionally (see
    examples in
    http://www.qaya.org/academic/hebrew/Ketiv-Qere-difficult.pdf, bottom of
    p.3 and top of p.4).

    >Trying to invent some "meaningful concept" for HEBREW LETTER
    >SIN WITH DOT which is different from one of the two representations
    >above in some way ...
    >

    I am not trying to do this.

    >... is just a recipe for *non*-interoperability
    >with the standard and implementations of it, rather than
    >helping any.
    >
    >Or perhaps what you really have in mind is:
    >
    >HEBREW LETTER SIN WITH DOT BECAUSE THE UTC SCREWED UP THE
    > CANONICAL CLASS ASSIGNMENT OF HEBREW COMBINING MARKS
    >
    >Would that suffice?
    >
    >

    No, Ken, this is not what I have in mind. Why do you assume that I am
    trying to stir up trouble on this issue? It looks to me as if there may
    be a real problem over named character sequences and canonical
    stability. The one example I picked was not the best because the
    canonically equivalent presentation form does exist. But a case could be
    made for defining HEBREW LETTER FINAL MEM WITH DAGESH as a named
    character sequence (or for that matter for defining it as a new
    presentation form, filling in the hole at U+FB3D), and if this decision
    was made there would then be a problem of normalisation stability when
    this sequence is further combined with QAMETS - which is a combination
    actually found in the Hebrew Bible.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.308 / Virus Database: 266.11.5 - Release Date: 04/05/2005
    


    This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 20:12:53 CDT