Re: Identifiers

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 16 2001 - 15:46:23 EDT


Florian Weimer responded:
 
> DougEwell2@cs.com writes:
>
> > > > In general, the problem is unsolvable. There are several look-alikes
> > > > among the Cyrillic, Greek, Latin and Cherokee blocks, among others.
> > >
> > > And those are not equivalent under normalization? That's a pity.
> >
> > As others have explained, Unicode does not specify (nor should it) any type
> > of "normalization" mechanism to equate similar-looking glyphs that belong to
> > different scripts.
>
> There should be a method to overcome the source sepearation rule which
> might have saved certain identical characters from unification.

There are a couple of errors here. The "source separation rule" is a
specific rule that had to do with the handling of source sets for
Han characters. It has no applicability to the rest of the Unicode
characters.

Second, this is once again mixing up characters and glyphs. Identical
*characters* are only encoded once (with the oft-noted but small set
of exceptions such as A-ring and Angstrom). Different *characters* that
happen to have identical or near-identical glyphs, but which derive
from separate scripts are encoded separately.

It is not a goal of the Unicode encoding to prevent glyphic ambiguity
across scripts.

However, there is an attempt to prevent the multiple encoding of
punctuation that can be used in multiple scripts -- since the script
identity of such characters is not as obvious as that of the letters.
So there *is* an attempt to prevent the needless cloning of such
characters in multiple scripts -- which would clearly be another source
of visual ambiguity if it were to occur.

>
> > - U+0048 LATIN CAPITAL LETTER H
> > - U+0397 GREEK CAPITAL LETTER ETA
> > - U+041D CYRILLIC CAPITAL LETTER EN
> > - U+13BB CHEROKEE LETTER MI
>
> If this were Han glyphs, they would have been unified, wouldn't they? ;-)

No, because this is a meaningless counterfactual.

You determine ahead of time that Han *characters* used in China, Japan,
Korea, and Vietnam are all the same *script*. Then you apply unification
within that script, so as not to needlessly duplicate the same characters
over and over because of different national encodings for different
subsets of Han characters.

But we determined long ago that for the purposes of computer character
encoding, Latin, Greek, Cyrillic, and Cherokee are distinct scripts.
Unifications are *not* applied across scripts just because letters
happen to look alike in particular instances.

>
> I don't think it's a general Unicode problem, but you have to know
> about this issues in order to design protocols which permit a large
> Unicode subset in identifiers and can nevertheless be used
> sucessfully.

I agree with this. If people don't realize that they are going to
come up against glyph ambiguity problems when dealing with Unicode,
then they should be told so in no uncertain terms.

But as you indicated, this is not really anything new to Unicode.
It is the same inherent problem that comes from trying to deal
with a collection of ISO 8859-n encodings together, which may
happen to mix Latin, Greek, and Cyrillic characters. Any protocol
that has a problem with such a mix is going to have the same
problem, on an even bigger scale when it moves to Unicode.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT