L2/00-429

From: Helmut Richter [Helmut.Richter@lrz-muenchen.de]
Sent: Thursday, December 07, 2000 9:05 AM
Subject: Re: Errors in tables and data (Hebrew diacritics)

Dear Mr. Whistler,

thank you very much for your swift response.

As far as combining classes are involved, I have first to study in more
detail to what purpose they are needed, and, more importantly, to what
purpose not. I do not use them in any of my programs or data anyway.

The main point is this:

> However, I should let you know some of the things that the UTC can and
> cannot do.
> 
> Please review the Unicode Consortium Policies:
> 
> http://www.unicode.org/unicode/standard/policies.html
> 
> These state the general principles the UTC follows to ensure
> character encoding stability.
> 
> In particular, with reference to item 1. in your document,
> "Error pertaining to the characters U+0598 and U+05AE", it isn't
> going to be possible to change the character names or to swap
> the interpretation of the characters, as you propose.
> 
> However infelicitous the names may be, we are stuck with them.
> 
> When this issue came up before, in response to review of
> Unicode 2.0, we added the "zinorit" note to U+0598, to clarify
> the intent of that character. In effect, U+0598 is what you
> identify as "Tsinnorit" (i.e. written on top), even though
> it is named HEBREW ACCENT ZARQA. And U+05AE is what you identify
> as "Zarqa, or Tsinnor" (i.e. pospositive), even though it
> is named HEBREW ACCENT ZINOR.
> 
> So the glyphs are correct and the combining classes are correct;
> only the names are misleading -- and we cannot change that.
> What we can do is add longer, discursive notes so that the
> interpretation is clear to everyone.

I think I understand that Unicode cannot change the names, nor the
interpretation if there were one. But there is none. Without the "zinorit"
note, there were just the two pictures swapped, with it, nothing fits
together any more (a 2:2 tie in arguments for the two interpretations, see
below).

I have taken the relevant passages of the policy (quoted with * below) and
tried to interpret them for the situation at hand. I still find my
suggestion fully conformant with this policy, with the only exception that
a change in character names is not permitted. Well, I can live with
calling Tsinorit "ZINOR", at least better than with calling Zarqa "ZINOR"
because "ZARQA" is reserved for Tsinorit.

But here is the policy and its application:

* 1. Once a character is encoded, it will not be moved or removed.

The two characters continue to be treated as different characters, and not
as a case where the different usage leads to slightly different glyph
positionings. If it were well-defined which character is on which code
point, this association could not be changed, even it is not satisfactory.
But it is not well-defined, see below.

* Note: Ordering of characters is handled via collation, not by moving
* characters to different codepoints.

The order argument I wrote (distinctive marks together; conjunctive marks
together) is by itself not sufficient to enforce a move of well-defined
characters. However, this is not what I suggested. Rather, I used the code
order for the disambiguation of the two ambiguously defined characters,
that is, as a means to tell which character is which despite their
contradictory names and properties (see below).

* 2. Once a character is encoded, its character name will not be changed.

The character 0598 will continue to carry the name ZARQA, and the
character 05AE will continue to carry the name ZINOR. My suggestion
must therefore be modified to no longer contain a change of names, but
only a more precise definition of their meaning.

* In some cases the original name chosen to represent the character is
* inaccurate in one way or another. Any such inaccuracies are dealt
* with by adding annotations to the character name list (which is
* printed in the Unicode Standard and provided in a parseable format),
* or by adding descriptive text to the standard.

As they are, these "original names" are indeed inaccurate. Hence,
there is a need to add "descriptive text".  The question is whether to
add the following:

  The character name ZARQA stands for the character "Zarqa", in other
  context also known as "Tsinor", and the character name ZINOR
  stands for the character "Tsinorit".

or else:

  The character name ZARQA stands for the character "Tsinorit", and the
  character identifier ZINOR stands for the character "Zarqa", in other
  context also known as "Tsinor".

where I prefer the first version.

It is not so that either interpretation follows from the present version
of the standard.  Rather, the reader is totally left in the dark: The
order in the code and the names of the characters both favor the first
interpretation, but the glyph in the chart and the remark "zinorit" at
ZARQA both favor the second interpretation.  As you can see from my own
tables, I took the first interpretation intuitively for granted.  
Changing to the second would mean a introducing a discontinuity for all
those who also found this interpretation more plausible.

Again: I do not consider the order of the code points in any way normative
in itself.  But it is one of the strongest hints for the resolution of the
ambiguous situation, in my eyes more than glyph charts. Consider as an
example a code table where the letter S has been messed up with the §
sign.  When you read the code sequence ...,P,Q,R,S,T,... you will
spontaneously decide that the character on the code position between R and
T is the letter S, even if the glyph in the table shows § and the letter S
has a remark telling that this is also the "Saragraph" sign, whatever that
means. This is exactly the situation here: the distinctive mark (in the
rank of a "duke") Zarqa is between the "dukes" Revia and Pashta, after the
"emperors" and the "kings", ahead of the "officers" and the "servants".  
It is absolutely clear what is meant, despite the glyph in the chart
showing a Tsinorit, one of the "servants".

To sum up: If there is at all a unique current situation which, according
to your policies, must not be changed, then it is the first of the two
interpretations.

* 3. Once a character is encoded, its canonical combining class and
* decomposition (either canonical or compatibility) will not be
* changed in a way that would affect normalization.

I'll check that later. I consider the impact of adjusting two combining
classes much smaller than the impact of having two characters swapped.

* 4. Once a character is encoded, its properties may still be changed,
* but not in such a way as to change the fundamental identity of the
* character.

It is the "fundamental identity" that is lacking to the two characters
altogether now: each of them has some properties of one characters,
and some properties of another character.


To sum up: I am aware that it had been better, had my comments arrived
before the first attempt to clarify the situation was undertaken. From the
direction of that modification, I conclude that the situation its
inventors had in mind resembles more the second interpretation of the
ambiguity. Were it so that the situation were consistent now, we would
have to live with it. A consistent situation, however, exists at most in
the heads of the code designers, but certainly not in the written
standard. There is still a need for clarification, and I continue to
suggest that the existing name ZARQA be used to denote the existing
character Zarqa and not to swap the two characters, even though this was
attempted with the last modification. After all, this is, despite the
modification, the more plausible interpretation of the current standard,
and the standard should not be interpreted by means of its modification
history.


Best regards,

Helmut Richter

==============================================================
Dr. Helmut Richter                       Leibniz-Rechenzentrum
Tel:   +49-89-289-28785                  Barer Str. 21
Fax:   +49-89-2809460                    D-80333 Muenchen
Email: Helmut.Richter@lrz-muenchen.de    Germany
==============================================================


	4