L2/00-429 From: Helmut Richter [Helmut.Richter@lrz-muenchen.de] Sent: Thursday, December 07, 2000 9:05 AM Subject: Re: Errors in tables and data (Hebrew diacritics) Dear Mr. Whistler, thank you very much for your swift response. As far as combining classes are involved, I have first to study in more detail to what purpose they are needed, and, more importantly, to what purpose not. I do not use them in any of my programs or data anyway. The main point is this: > However, I should let you know some of the things that the UTC can and > cannot do. > > Please review the Unicode Consortium Policies: > > http://www.unicode.org/unicode/standard/policies.html > > These state the general principles the UTC follows to ensure > character encoding stability. > > In particular, with reference to item 1. in your document, > "Error pertaining to the characters U+0598 and U+05AE", it isn't > going to be possible to change the character names or to swap > the interpretation of the characters, as you propose. > > However infelicitous the names may be, we are stuck with them. > > When this issue came up before, in response to review of > Unicode 2.0, we added the "zinorit" note to U+0598, to clarify > the intent of that character. In effect, U+0598 is what you > identify as "Tsinnorit" (i.e. written on top), even though > it is named HEBREW ACCENT ZARQA. And U+05AE is what you identify > as "Zarqa, or Tsinnor" (i.e. pospositive), even though it > is named HEBREW ACCENT ZINOR. > > So the glyphs are correct and the combining classes are correct; > only the names are misleading -- and we cannot change that. > What we can do is add longer, discursive notes so that the > interpretation is clear to everyone. I think I understand that Unicode cannot change the names, nor the interpretation if there were one. But there is none. Without the "zinorit" note, there were just the two pictures swapped, with it, nothing fits together any more (a 2:2 tie in arguments for the two interpretations, see below). I have taken the relevant passages of the policy (quoted with * below) and tried to interpret them for the situation at hand. I still find my suggestion fully conformant with this policy, with the only exception that a change in character names is not permitted. Well, I can live with calling Tsinorit "ZINOR", at least better than with calling Zarqa "ZINOR" because "ZARQA" is reserved for Tsinorit. But here is the policy and its application: * 1. Once a character is encoded, it will not be moved or removed. The two characters continue to be treated as different characters, and not as a case where the different usage leads to slightly different glyph positionings. If it were well-defined which character is on which code point, this association could not be changed, even it is not satisfactory. But it is not well-defined, see below. * Note: Ordering of characters is handled via collation, not by moving * characters to different codepoints. The order argument I wrote (distinctive marks together; conjunctive marks together) is by itself not sufficient to enforce a move of well-defined characters. However, this is not what I suggested. Rather, I used the code order for the disambiguation of the two ambiguously defined characters, that is, as a means to tell which character is which despite their contradictory names and properties (see below). * 2. Once a character is encoded, its character name will not be changed. The character 0598 will continue to carry the name ZARQA, and the character 05AE will continue to carry the name ZINOR. My suggestion must therefore be modified to no longer contain a change of names, but only a more precise definition of their meaning. * In some cases the original name chosen to represent the character is * inaccurate in one way or another. Any such inaccuracies are dealt * with by adding annotations to the character name list (which is * printed in the Unicode Standard and provided in a parseable format), * or by adding descriptive text to the standard. As they are, these "original names" are indeed inaccurate. Hence, there is a need to add "descriptive text". The question is whether to add the following: The character name ZARQA stands for the character "Zarqa", in other context also known as "Tsinor", and the character name ZINOR stands for the character "Tsinorit". or else: The character name ZARQA stands for the character "Tsinorit", and the character identifier ZINOR stands for the character "Zarqa", in other context also known as "Tsinor". where I prefer the first version. It is not so that either interpretation follows from the present version of the standard. Rather, the reader is totally left in the dark: The order in the code and the names of the characters both favor the first interpretation, but the glyph in the chart and the remark "zinorit" at ZARQA both favor the second interpretation. As you can see from my own tables, I took the first interpretation intuitively for granted. Changing to the second would mean a introducing a discontinuity for all those who also found this interpretation more plausible. Again: I do not consider the order of the code points in any way normative in itself. But it is one of the strongest hints for the resolution of the ambiguous situation, in my eyes more than glyph charts. Consider as an example a code table where the letter S has been messed up with the § sign. When you read the code sequence ...,P,Q,R,S,T,... you will spontaneously decide that the character on the code position between R and T is the letter S, even if the glyph in the table shows § and the letter S has a remark telling that this is also the "Saragraph" sign, whatever that means. This is exactly the situation here: the distinctive mark (in the rank of a "duke") Zarqa is between the "dukes" Revia and Pashta, after the "emperors" and the "kings", ahead of the "officers" and the "servants". It is absolutely clear what is meant, despite the glyph in the chart showing a Tsinorit, one of the "servants". To sum up: If there is at all a unique current situation which, according to your policies, must not be changed, then it is the first of the two interpretations. * 3. Once a character is encoded, its canonical combining class and * decomposition (either canonical or compatibility) will not be * changed in a way that would affect normalization. I'll check that later. I consider the impact of adjusting two combining classes much smaller than the impact of having two characters swapped. * 4. Once a character is encoded, its properties may still be changed, * but not in such a way as to change the fundamental identity of the * character. It is the "fundamental identity" that is lacking to the two characters altogether now: each of them has some properties of one characters, and some properties of another character. To sum up: I am aware that it had been better, had my comments arrived before the first attempt to clarify the situation was undertaken. From the direction of that modification, I conclude that the situation its inventors had in mind resembles more the second interpretation of the ambiguity. Were it so that the situation were consistent now, we would have to live with it. A consistent situation, however, exists at most in the heads of the code designers, but certainly not in the written standard. There is still a need for clarification, and I continue to suggest that the existing name ZARQA be used to denote the existing character Zarqa and not to swap the two characters, even though this was attempted with the last modification. After all, this is, despite the modification, the more plausible interpretation of the current standard, and the standard should not be interpreted by means of its modification history. Best regards, Helmut Richter ============================================================== Dr. Helmut Richter Leibniz-Rechenzentrum Tel: +49-89-289-28785 Barer Str. 21 Fax: +49-89-2809460 D-80333 Muenchen Email: Helmut.Richter@lrz-muenchen.de Germany ============================================================== 4