Unicode 4.0 comments

L2/02-369

Subject: Feedback on Unicode 4.0, chapters 1-4

Date: 30 October 2002

From: Cathy Wissink (on behalf of Windows International; see list of reviewers at end of document)

(Note: with many reviewers involved, you may see more than one comment to a particular section. I have tried to resolve the responses to be as non-conflicting as possible, but you may see slightly different suggestions to the same section as a result—[caw])

General comment to the book as a whole:

We have gotten many questions about breaking out the various UAXes and UTRs separately from the book. While the reason for leaving these out of the book is (somewhat) clear if one is involved in the process, this is not quite so obvious to the average user of the standard. It would be of benefit to call out the reasons for not adding this information to the printed book, to avoid this confusion in the future.

Chapter 1:

Page 1. "Any implementation that is conformant to Unicode is therefore conformant to ISO/IEC 10646."

Page 6, section 1.4. "The Unicode Standard is fully compatible with the International Standard ISO/IEC 10646-1:2000,...."

Conformance requirements are different between TUS and 10646. As such, the conformance comparison between the two standards is overstated, as shown in the sentence on page 1; it oversimplifies the conformance implications. On the other hand, the word choice for the comparable sentence on page 6 works; “compatible” is much more appropriate than “conformant”. Suggest changing tone of sentence on page 1 to match that on page 6, to avoid misleading statements.

Page 3, section 1.1. At the top the page, "rarely exchanged …characters” are mentioned in regards to items not added to Unicode – isn’t it already the case that many of the characters that have been added (and potential future characters) are rarely exchanged? Better to remove this section than to have such an easily challenged statement. (As contrasting examples: in Chapter 2 it claims that over 99% of all UTF-16 data is expressed as single code units. No one would argue that 1% does not qualify as a reasonable definition of "rarely exchanged". Second example: in Chapter 2 on page 23 it also claims that supplementary characters are extremely rare.)

Page 6, section 1.5: “National Committee for Information Technology” and “NCITS” should read “InterNational [note case] Committee for Information Technology” and “INCITS” respectively. (See www.incits.org for reference.)

Chapter 2:

Page 21, section 2.2. Figure 2-6 is exceptionally confusing. (One reviewer suggested the title of the figure should remain “????????”.) Why do both arrows exist in a decomposition if the distinction between the different types of decomposition is trying to be made? Suggest a new table entirely (perhaps even just a table comparison rather than a graphic), or striking the inappropriate arrows. (The definition of the arrows is also too far away from the figure—over 3 paragraphs away.) This whole section needs an overhaul; most reviewers were completely confused by it.

Page 24, section 2.4. Figure 2-7 should probably have the full code point value listed (e.g., U+00C5 rather than C5, unless this is UTF-8? What then is 30A?). In addition, there appears to be no distinction in the arrows, even though the text refers to hollow and solid arrows.

Page 26, section 2.4. Regarding reserved code points: it says that they cannot be interchanged yet must be preserved. This is confusing. How is it preserved in an email if you cannot send the email or forward it? It seems it is up to the implementation with whether they handle this or not, right? Suggest rewriting section to include some type of higher-level protocol override.

General statement to section 2.5. Regarding encoding forms vs. encoding schemes – this already has been heavily debated, but the whole issue of encoding forms vs. schemes seems hideously contrived, somewhat like the Unix locales that added special locales for code page information. Is it truly necessary to separate them out in this way, with two different terms, that are essentially talking about the same thing? Suggest clarification of why this is necessary, if that is the case.

Page 29, section 2.5. (UTF-32)

The statement that UTF-32 “is a truly fixed-width character encoding form” is very confusing, as it’s implied that a one-to-one mapping between character and glyph is ensured whenever UTF-32 is used. In reality, there's no benefit in choosing UTF-32 for that expectation (i.e. assuming 'fixed-width' simplifies boundary detection logic), since developers have to deal with combining character sequences in any Unicode encoding form.

Page 30, section 2.5. (UTF-16)

In relation to the fixed-width discussion in UTF-32, it's stated in the 3rd paragraph that “UTF-16 is definitely somewhat more complicated to handle than UTF-32.” but strangely, in the 6th paragraph of "Comparison of Advantages of UTF-32, UTF-16 and UTF-8", it naturally tones down to say that 'the fixed-width advantage of UTF-32 is somewhat offset by the inherently variable-width nature of processing text elements.'

Suggest removing “fixed-width” advantage argument in UTF-32, UTF-16, UTF-8 and the comparison sections (pages 29 - 32).

Page 30, section 2.5. Under UTF-32, there is a statement that “This [the range of UTF-32] precisely matches the range of characters defined for other standards such as XML”. How is the range of characters in UTF-32 different than UTF-8 or UTF-16? What is the point of this statement? It appears to imply that UTF-32 is somehow more conformant to other standards than the other UTFs. Suggest striking statement or rewriting.

Page 32, section 2.5. Regarding the statement: "UTF-8 is reasonably compact in terms of number of bytes used. It is really only at a significant size disadvantage when used for East Asian implementations such as Chinese, Japanese, and Korean, which use Han ideographs or Hangul syllables requiring 3 byte code unit sequence in UTF-8." Do we not think that the four byte forms are also at a disadvantage? Extension B or Deseret (for example) will have the same disadvantage, but even more so.

Page 32, section 2.5 (bottom of page). Regarding "This can lead to complications when trying to interoperate with sorted lists between UTF-8 systems and UTF-16 systems”:

Suggest changing the wording, for clarity: "This can lead to complications when trying to interoperate with binary sorted lists between (for example) UTF-8 systems and UTF-16 systems." Or something similar -- the idea of the first change being to limit the scope of the issue to the actual people affected and of the second change being to make it clear that there is another example (such as UTF-16 vs. UTF-32).

Page 33, section 2.6. The last paragraph of the section discusses ways of dealing with isolated surrogates, however the implication is that there are only 3 different ways to handle the surrogate, as listed in the second to last sentence of the paragraph. Suggest changing the beginning of this sentence to read: “There are a number of different techniques for handling such conversion, including:”

Page 46, section 2.12. Concerning the following statement: “Code conversion between other standards and the Unicode Standard will be considered conformant if the conversion is accurate in both directions”: “accurate” needs to be better defined. If this statement is exclusively talking about 'roundtrip' integrity, it doesn't seem to allow for the possibility of fallback mapping between legacy character sets and Unicode, which has been implemented on Windows since NT 3.1. Consider rewriting sentence to better clarify accuracy.

Furthermore, what is the benefit for Unicode to mention conversions to other standards as part of conformance requirements? While Unicode is used as a pivot to map between encodings and it is a very valuable use of the standard, it is not clear a conformance statement is needed in terms of other standards. Suggest focusing on the conformance of conversion between Unicode and encodings, not standards.

Chapter 3:

General comments:

Data Interchange

The heart of conformance should be requirements for data interchange. A document or database encoded using Unicode is as much an implementation of the standard as a C program that reads the document. The existing clauses concentrate on the processes that deal with such data, but say too little about the data itself. Specifically, “Versions of the Unicode Standard” and “Stability” must discuss the validity of existing data as the standard evolves; e.g. is every code unit sequence that conformed to Unicode 2.1.0 still valid? Which properties of a code sequence are immutable, and which may change?

Specifically, add a sentence to the end of the first paragraph on page 50. “Code unit sequences conforming to a previous version of the standard may no longer be valid”.

Conformance versus Good Practice

The conformance clauses should not mandate what is merely good practice. The existing clauses are sometimes overzealous. This weakens, not strengthens, the standard by making true conformance ambiguous as evidenced by the number of exceptions required in the explanatory notes. Who decides what additional exceptions apply? If the implementer decides, then this nullifies the conformance clause. If Unicode decides, then the list of exceptions must be complete. Specifically, it is preferable to have C9, portions of C12a, and C13 moved to a section on best practices.

Consistency

It was clear when reading chapter 3 that there was some difference of opinion on many points, and obviously it must have been a difficult negotiation process to produce the chapter. However, there seems to be a great deal of conflict between different compromises that were made in various parts of the chapter, conflict which leaves the reader wondering exactly what conformance is. Specific examples are in the chapter feedback below, but when the Editorial Committee is working through these issues, they should consider the fact that they should have a general consensus on principles (e.g. forward compatibility) prior to making the changes, in order to maximize the clarity and minimize the confusion.

Specific comments:

Page 49, Section 3.1.

Regarding the statement "Each new version of the Unicode Standard replaces the previous one and makes it obsolete": This is really way too much to say -- why be so extreme in implying that the old version is now obsolete? When combined with the next paragraph that states: "Implementation should be prepared to be forward-compatible with respect to Unicode versions", we are placing a huge burden on implementers to constantly stay in sync with the latest version -- all others are obsolete. With a great number of the membership of Unicode probably obsolete by this definition, we think the language needs to be recast.

Page 49, section 3.1. In the fourth paragraph of 3.1, there is a reference to 5.3 -- supposedly 5.3 is informational, yet it is the source for something that implementations are supposed to do. It seems unfair to have an informative section referenced by normative recommendations.

Page 50, section 3.1. Stability -- This section makes the stability claim, yet ignores the fact that a few paragraphs ago we were told that the old versions were obsolete. This obsolescence issue should be fixed or removed; otherwise lots of other text in the chapter looks strange/conflicting.

Page 50, section 3.1. Regarding: "A version change may also involve changes to the properties of existing characters. <snip> Changes to the data files may alter program behavior that depends on them"—it is strongly desired that the UTC does not change 'normative' properties of existing characters, or otherwise previously conformant implementations will be unable to conform to the future version of the standard. Even in different versions, changes that affect program behavior should be considered unreliable.

Suggest adding the following sentence or something close at the tail of the text above: "However, normative properties of any existing characters in the standard will be kept unchanged across future versions of the Unicode Standard." If this is not the case and the Editorial Committee decides to keep the section unchanged, the UTC needs to understand the implications to implementers of changing properties across versions of the standard.

Section 3.2 (conformance clauses):

Page 51, C6: This seems at odds with W3C and others who will, in some cases, limit items such as identifiers to an earlier version and specifically not be forward compatible, such that they accept code points unassigned in their version but possibly assigned later. The reasons for doing this are equally valid as the goals as the one espoused by Chapter 3. Is there any reason to call this behavior out as wrong, the way the chapter currently does?

Page 52, C8: Yet another section that contradicts the ideals given in 3.1. Perhaps this additional discussion should be removed entirely? It seems to conflict with the whole chapter so far.

Page 53, C11 & C12: These are an inadequate replacement for the old C1 etc. Rather than these general-purpose and rather self-evident clauses, it would be better to have more concrete rules specific to each of the encoding forms. For example, C11a could be:

C11a A UTF-8 code unit sequence conforms if and only if it is a sequence of bytes matching the forms in Table 3-6. A conforming process generates only conforming UTF-8 code unit sequences. A conforming process recognizes only conforming UTF-8 code unit sequences.

Page 54, C12b: The first bullet point mentions UTF-16LE without defining it and then goes on to speak about it; the second bullet point then explains what little-endian is. They should be switched. (This is of course a moot point if the previous comment on C11 and C12 is accepted.)

Page 54, C12b: From the first bullet: "For example, when using UTF-16LE...<snip> any initial <FF FE> sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text) rather than as a byte order mark (not part of the text)."

There's some confusion here. It's true that any 'initial' <FF FE> in UTF-16LE can be interpreted as a ZWNBSP, but my understanding is that the UTC's position is that it's actually ensured to not be part of the text, but is exclusively a BOM after Unicode introduced WORD JOINER (U+2060) - this is part of the text. This description will confuse people into thinking that U+FFFE is a BOM (alias of ZWNBSP), while code point U+FFFE is an ensured non-character. In fact, the later section (D42 2nd bullet) refers to the BOM (U+FEFF) as not part of the text.

Suggest changing the text to: "any initial <FF FE> sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (also known as a byte order mark) rather than as U+FFFE (the code point is ensured to be a noncharacter)." Reference: http://www.unicode.org/unicode/reports/tr28/ (search 'noncharacter' and/or 'word joiner')

Page 54, C14, C15 & C16: It is preferable to have these three replaced with the single clause referencing Annex #15 and with wording parallel to C13 e.g. “A process that normalizes text shall yield results identical to the process described in Annex #15”. Even better, replace this with text that defines a conforming normalized code unit sequence.

Specific to C16: There is an implication in C16 that if an implementation passes the test as defined at http://www.unicode.org/unicode/reports/tr15/#Test that the implementation is conformant to the particular normalization form. We have found in our development that this test is not sufficient to completely determine conformance, and suggest rewording as listed above (in reference to C14, 15, 16).

Page 54, C17, C18, & C19: We do not understand the purpose of these clauses at all. They seem to have no bearing on implementations but rather on other standards, which are certainly out of scope. Are we unilaterally declaring that all standards, everywhere, that reference Unicode in some other fashion are non-conforming? Suggest removing them.

Page 54, C20: Again, we do not understand this clause. They seem to apply to higher-level protocols, not to implementations of the standard. A higher-level protocol is, by definition, outside the scope of the standard. So how can it make a normative reference? In particular, if the higher-level protocol chooses to use provisional properties or completely bogus properties, that is its business and not subject to this standard, as long as the protocol does not imply conformance. Suggest removing it.

Page 55, C21: Contradicts D8a. The wording in D8a is preferred. The standard should not require conformance to an algorithm but rather use algorithms as a convenient way to express the desired result. Conformance is to the result, not the algorithm. Suggest removing it.

Page 55, section 3.2. Under Unicode Standard Annexes - Why is UAX#29 listed since we have not actually approved it yet? This seems odd.

Page 56, section 3.4, D2b, last note: As with C20, it is not clear how (or why) we can limit higher-level protocols in this way. Consider the HTML fragment <p>B<img…>C where B is a base character and C is a combining character. According to this note, C must combine with the >. Suggest removing it.

Page 57, section 3.4. Regarding D5, 3rd and 4th bullets (decomposition): "A single abstract character may correspond to more than one code point...." One basic principle of any coded character set standard is to avoid "duplicate' encoding of characters so that a single abstract character should be unambiguously specified with a unique bit combination.

This sentence sounds as if the Unicode Standard violates that common principle because a single abstract character in TUS 'may not be specified uniquely'. (E.g., LATIN CAPITAL LETTER A U+0041 will never 'correspond to' GREEK CAPITAL LETTER ALPHA U+0391 in spite of their typographical resemblance.) It is understood that this is because of a compatibility requirement with other external standards that Unicode refers to. If so, it seems more appropriate to add a note that refers to canonical and/or compatibility decomposition in 3.7.

Page 57, section 3.4. D7b and D7c should reference section 2.4 and table 2.2.

Page 58, section 3.4. D8a is a very useful definition, but should be referenced in a number of places (e.g., in the rewriting of C21, since C21 contradicts D8a currently). It is too buried in the text as is. Anytime a Unicode algorithm is mentioned (bidi, UCA, etc.), this should be called out.

Section 3.9 vs. 3.10. Schemes vs. Forms – there still seems to be a lot of confusion, even by seasoned implementers. Why exactly do we need two terms here? Are they not basically the same? The differences listed in D39-D42 seem to make Unicode look just like a code page-based mechanism with different ways to represent bytes.

Page 73, section 3.10. (The last paragraph before section 3.11.) "When converting between different encoding schemes, extreme care must be taken...<snip> This is why the use of initial <EF BB BF> as a signature on UTF-8 byte sequences is not recommended by the Unicode Standard."

This is about the BOM again, but I believe this paragraph is entirely incorrect in the view of the UTC who introduced WORD JOINER. It states that converting U+FEFF between UTF-16 and UTF-8 introduces ambiguity in interpretation whether to process the character as BOM or ZWNBSP. It's totally untrue. Initial U+FEFF in a Unicode stream is always interpreted as 'a signature of Unicode character sequence (or Annex H in 10646-1)' and for the purpose of ZWNBSP as a part of text, U+2060 should be used.

Suggest deleting the paragraph or consider re-wording so that it claims there should be no ambiguity in interpretation of the 'initial' U+FEFF.

(By the way, where does TUS put the conformance description about a process like 'cat' program which concatenates two Unicode text streams both of which start with 'initial' U+FEFF? Is the 2nd U+FEFF should be replaced with U+2060 or such process should remove the 2nd one from the resulted text? Would they have no conflict with other conformance descriptions?)

Page 73, section 11. This looks like material for UAX #15. Why is it here? If normalization is important enough to make a part of the book, then UAX#15 should be put in the book. If not, then why have 3-4 pages of information on it?

Metacomment on BOM vs. ZWNBSP vs. U+2060 (spanning chapters 2, 3, 4):

· Chapter 2, table 2-3: states BOM is not allowed and does not really seem to even suggest that the ZWNBSP option is there.

· Chapter 3, D40 and D41 both state that if the BOM if there, then it’s really a ZWNBSP.

· Chapter 4 at the end introduces the Word Joiner (U+2060) as preferred over ZWNBSP and that the ZWNBSP should not be used, the code point should always be a BOM.

Could the three different people who wrote these bits of text come to a consensus on exactly what the BOM/ZWNBSP/U+2060/UTF16 LE/UTF16 BE definitions are in regards to this one [troublesome] character? Especially in light of the fact that I would argue that if the BOM matches the encoding scheme, this was an intentional effort to make sure that applications which may not understand the higher level protocol can also see what the encoding scheme is. That is a much more reasonable assumption, and a perfectly valid thing to do (except for the fact the standard is claiming it is not conformant).

(Reviewers: Cathy Wissink, Michael Kaplan, John McConnell, Yasuhiro Anan, Shawn Steele, Igor Sinitsyn)