Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 09 2005 - 08:39:10 CDT

Next message: faraz siddiqi: "help required"

Previous message: Poopathi Manickam: "Re: Tamil Script and Tamil Grantha Script differences"
In reply to: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Next in thread: Markus Scherer: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Reply: Markus Scherer: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Kenneth Whistler" <kenw@sybase.com>
>> >No. The specification should be clear.
>> >
>> >A Unicode Named Character Sequence is a specific sequence of
>> >code points associated with a name.
>> >
>> >
>> >
>> Does it have to be a sequence which is stable under all kinds of
>> canonical transformation?
>
> No, it has to be what it says it has to be: a sequence.
>
>> Or just under normalisation? Can it ever be a
>> sequence of a base character and a combining character (of combining
>> class greater than 1)? If it can, then there is always the possibility
>> that a combining character of lower combining class is also combined
>> with the same base character, which means that the sequence is not
>> stable under normalisation. But several of the examples given in UAX #34
>> are such sequences, which are not stable under normalisation. This is
>> the issue which Philippe was trying to address, as I understood it.
>
> No. Yes. So what. So what. So what. Respectively.
>
> Sorry to be glib here, but there is no reason for you and
> Philippe to take a simple thing that is what it says it is --
> a Unicode Named Character Sequence -- and start rerunning all
> the nightmare scenarios on it yet one more time.
>
> A Unicode Named Character Sequence is not some Platonic abstraction
> that needs to have some semantic identity associated with it under
> all conceivable contortions with format characters and combining
> marks in its vicinity.

So why standardizing named character sequences, if they don't have their own
semantic in other related standards or mapping tables where they HAVE a
semantic?
I am convinced that the existing standard named sequences have their own
semantic, and that they are already maore than just a named sequence, i.e.
they should be treated as a single unit in most processings.

I am also convinced that these named sequences will avoid adding new
compatibility characters (such as HEBREW LETTER SHIN WITH WHIN DOT), which
also already have their own semantics and should be treated as a single
unbreakable unit in most processings but that also have a decomposition
mapping thatallow them to be normalized into sequences of codepoints.

To make things clear, if Unicode just considers that named character
sequences are just sequences of code points without specific semantic, they
are basically useless, and unneeded in the standard (meaning that almost
everybody will ignore them, notably because they already are not stable
under normalization).

Instead, I really view the addition of these named sequences as a convenient
way to describe that these sequences are recommanded interpretations and
encodings for commonly used abstract entities which are encoded with more
than a single codepoint in Unicode and ISO/IEC 10646. And that they should
be recognized as such in ALL conforming processes that need to parse
combining sequences containing them, even if additional combining characters
are inserted in the middle (notably because of normalization): the
interpretation of these sequences is still kept and the additional combining
characters in the middle or after them modifies the abstract named sequence,
instead of creating competely new unrelated entities.

These should have consequences too when implementing collation, and if this
rule is not applicable, this is because the inserted characters change
radicaly the semantic, and so there should exist another standard named
sequence documented to exhibit this change of interpretation. I think this
is extremely important to make those interpretations stable across various
systems, and interoperable (notably within "complex" scripts, such as
Brahmic-based South and South-East Asian scripts with the semantics of AU
vowel sequences, or Semitic scripts like Hebrew/Samaritan, Thaana or
Ethiopic, or historic scripts like cuneiform scripts still not encoded where
named sequences are likely to be required to make the standardized encoding
usable and interoperable in practice).

The same remark should be true for modern alphabetic scripts as well (I
include there the case of modern Vietnamese written with Latin characters
and multiple diacritics, but also the case of polytonic Greek, even if most
of the needed complex sequences are already mapped into Unicode/ISO/IEC
10646 using compatibility characters or sometimes with canonicaly
decomposable and recomposable equivalents).

Next message: faraz siddiqi: "help required"
Previous message: Poopathi Manickam: "Re: Tamil Script and Tamil Grantha Script differences"
In reply to: Kenneth Whistler: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Next in thread: Markus Scherer: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Reply: Markus Scherer: "Re: Named character sequneces and canonical equivalence, was: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 09 2005 - 08:40:59 CDT