Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 09 2005 - 08:39:10 CDT

  • Next message: faraz siddiqi: "help required"

    From: "Kenneth Whistler" <kenw@sybase.com>
    >> >No. The specification should be clear.
    >> >
    >> >A Unicode Named Character Sequence is a specific sequence of
    >> >code points associated with a name.
    >> >
    >> >
    >> >
    >> Does it have to be a sequence which is stable under all kinds of
    >> canonical transformation?
    >
    > No, it has to be what it says it has to be: a sequence.
    >
    >> Or just under normalisation? Can it ever be a
    >> sequence of a base character and a combining character (of combining
    >> class greater than 1)? If it can, then there is always the possibility
    >> that a combining character of lower combining class is also combined
    >> with the same base character, which means that the sequence is not
    >> stable under normalisation. But several of the examples given in UAX #34
    >> are such sequences, which are not stable under normalisation. This is
    >> the issue which Philippe was trying to address, as I understood it.
    >
    > No. Yes. So what. So what. So what. Respectively.
    >
    > Sorry to be glib here, but there is no reason for you and
    > Philippe to take a simple thing that is what it says it is --
    > a Unicode Named Character Sequence -- and start rerunning all
    > the nightmare scenarios on it yet one more time.
    >
    > A Unicode Named Character Sequence is not some Platonic abstraction
    > that needs to have some semantic identity associated with it under
    > all conceivable contortions with format characters and combining
    > marks in its vicinity.

    So why standardizing named character sequences, if they don't have their own
    semantic in other related standards or mapping tables where they HAVE a
    semantic?
    I am convinced that the existing standard named sequences have their own
    semantic, and that they are already maore than just a named sequence, i.e.
    they should be treated as a single unit in most processings.

    I am also convinced that these named sequences will avoid adding new
    compatibility characters (such as HEBREW LETTER SHIN WITH WHIN DOT), which
    also already have their own semantics and should be treated as a single
    unbreakable unit in most processings but that also have a decomposition
    mapping thatallow them to be normalized into sequences of codepoints.

    To make things clear, if Unicode just considers that named character
    sequences are just sequences of code points without specific semantic, they
    are basically useless, and unneeded in the standard (meaning that almost
    everybody will ignore them, notably because they already are not stable
    under normalization).

    Instead, I really view the addition of these named sequences as a convenient
    way to describe that these sequences are recommanded interpretations and
    encodings for commonly used abstract entities which are encoded with more
    than a single codepoint in Unicode and ISO/IEC 10646. And that they should
    be recognized as such in ALL conforming processes that need to parse
    combining sequences containing them, even if additional combining characters
    are inserted in the middle (notably because of normalization): the
    interpretation of these sequences is still kept and the additional combining
    characters in the middle or after them modifies the abstract named sequence,
    instead of creating competely new unrelated entities.

    These should have consequences too when implementing collation, and if this
    rule is not applicable, this is because the inserted characters change
    radicaly the semantic, and so there should exist another standard named
    sequence documented to exhibit this change of interpretation. I think this
    is extremely important to make those interpretations stable across various
    systems, and interoperable (notably within "complex" scripts, such as
    Brahmic-based South and South-East Asian scripts with the semantics of AU
    vowel sequences, or Semitic scripts like Hebrew/Samaritan, Thaana or
    Ethiopic, or historic scripts like cuneiform scripts still not encoded where
    named sequences are likely to be required to make the standardized encoding
    usable and interoperable in practice).

    The same remark should be true for modern alphabetic scripts as well (I
    include there the case of modern Vietnamese written with Latin characters
    and multiple diacritics, but also the case of polytonic Greek, even if most
    of the needed complex sequences are already mapped into Unicode/ISO/IEC
    10646 using compatibility characters or sometimes with canonicaly
    decomposable and recomposable equivalents).



    This archive was generated by hypermail 2.1.5 : Mon May 09 2005 - 08:40:59 CDT