L2/02-384 Subject: Comments on Unicode 4.0 draft chapters From: Sandra Martin O'Donnell Hewlett-Packard Company Date: Oct 31, 2002 ********* CHAPTER 2 ********* #1 Page 12, third paragraph from the bottom. The paragraph discusses "grapheme clusters," and points readers to UAX #29 (which, by the way, currently is a DUTR, not an approved standard annex). Given the confusion about characters, text elements, glyphs, etc., if this term really is to be used the way "a user thinks of as a `character'", then it needs to be in the book (Chapter 3?), rather than in an annex. #2 Page 12, third paragraph from the bottom. Illustrating the terminology confusion, the paragraph states that "Figure 2-1 illustrates the relationship between abstract characters and grapheme clusters", but Figure 2-1 is titled "Text elements and characters." I assume grapheme clusters is supposed to equate to characters, but then what does abstract characters equate to? It's not the text elements in the figure. For example, I don't think anyone believes the word "cat" is an abstract character, but it can be a text element. There is a lot of confusion about grapheme clusters. I recommend restoring the original terminology used in Figure 2-1 (Text elements and characters). #3 (NIT) page 13, next to last bullet on page, final sentence Most other references in the book point to a specific section rather than a section *and also* a chapter. To improve consistency, it should be "See Section 8.2, Arabic, and Section 9.1, Devanagari for detailed examples of this situation" rather than "See Section 8.2, Arabic, in Chapter 8, Middle Eastern Scripts, and Section 9.1, Devanagari in Chapter 9, South Asian Scripts, for detailed examples of this situation". #4 page 15, third paragraph under "Universality". Remove this paragraph. It discusses implementation issues that are not always relevant to a Unicode-enabled application. Many technologies need to continue to support non-Unicode encodings, and find that a code set independent design is very efficient for handling Unicode and non-Unicode. Also, Unicode itself has multiple encoding forms that mean it is necessary to understand and perhaps do late-binds based on the particular encoding form. These implementation-specific issues are not appropriate in the discussion of Unicode's universal repertoire. #5 page 15, "Characters, Not Glyphs" section Just a note that this and all other sections in the rest of this chapter (and Chapter 3) only discuss "characters", not "grapheme clusters". If there is a need for the new term, it should find its way elsewhere into the book. The fact that it has not found its way in is an indication that it is not needed, or is confusing. #6 page 20, "Decompositions" section This is not in the list of Unicode design principles, but it is at the same level (in terms of heading size) as the other principles. Is this a principle? If so, it needs to be added to the list. If not, it needs to be demoted, headings-wise, in the text. #7 page 21, Figure 2-6. I assume these are known glyph errors for the combining characters. #8 page 22, first sentence in "Compatibility Characters" section Simpler wording would be "Compatibility characters are those that would not have been encoded because they are in some sense variants of characters that already have encodings in the Unicode Standard." The parenthetical phrases are unnecessary, and there seems no need to introduce and use the term "normal" for non-compatibility characters. #9 page 23, first full paragraph beginning "In the past..." While this information is correct, I wonder how many non-experts will glean anything from it? Are we writing the book for gurus, and therefore need maximum precision, or for average developers, who need clear English? #10 page 23, last paragraph; 2nd sentence The text is unclear to me. It states: "Note that some abstract characters may be associated with more than one character (that is, be encoded "twice")." Should that read "...with more than one encoded character..."? #11 page 24, Figure 2-7 I assume some of the arrows should be solid, rather than all being hollow. #12 page 24, paragraph beginning "When referring..."; 2nd sentence Should that be "Encoded characters can be referred to by their code point only, but to prevent ambiguity..." rather than the current "Encoded characters can also be referred to by their code point, but to prevent ambiguity..." #13 page 26, 2nd paragraph in section "Encoding Forms" The text that says "...precisely-defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units." This is clear, but Chapter 3 still defines a Unicode scalar value. What is the difference between "code point", as described here, and "Unicode scalar value"? #14 page 28, first bullet This seems out of place. Should it be removed? #15 page 28, "Encoding Schemes" subsection The terminology for encoding forms and encoding schemes is SOOOO close that they are easily confused. Here are some suggestions: + Instead of defining 7 encoding schemes, some of which have names that are identical to existing encoding forms, the forms could continue to exist, and information about serializing the forms could simply be added that describes how these forms are used. Thus, the description of UTF-16 could add information about how these are serialized on big- and little-endian architectures, and how the BOM is handled/recognized. The same would be done for UTF-32. Okay, I hear the howls now...so, Alternative 2 is: + Change the name from "encoding scheme" to "serialization scheme". This would alleviate the confusion between "encoding form" and "encoding scheme". Earlier proposals to call these things CEF and CES (Character Encoding Form, and Character Encoding Scheme, respectively) suffer from the same confusion as the existing terms. The "encoding schemes" have to do with the way bytes are serialized on computer systems; they have little to do with encoding. #16 page 29, first full paragraph The text begins "Note that some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms." This is further evidence, IMO, that we either need to remove this extra distinction (my first preference), or find names that are not so easily confused. #17 page 29, paragraph below Figure 2-11 "In Figure 2-11, the columns labeled "Serialized" shows..." There is no column with that label in the figure. #18 page 30, 3rd paragraph in UTF-32 section "The value of each UTF-32 code unit corresponds exactly to the Unicode code point value." Regarding my earlier comment about the difference between "code point" and "Unicode scalar value", here's an example where it would seem logical to use "Unicode scalar value" as it's defined in Ch. 3. Do the two terms differ? Do we need the separate terms? #19 page 32, "Comparison of the Advantages of UTF-32, UTF-16, and UTF-8" Gee, Dad, so I guess you've always liked UTF-16 best, right? IMO, this section is biased toward UTF-16 and against UTF-32. Since all encoding forms are co-equal within Unicode, the text should be more evenly balanced. The text currently says, "UTF-16 is the internal processing code of choice for a majority of implementations supporting Unicode." I know that the majority of *Unix* implementations support Unicode via UTF-32 (e.g., Solaris, Tru64 Unix, Linux, and HP-UX). Is it really true that UTF-16 is in the majority? Even if it is, is that relevant? The text talks about pros and cons with respect to memory and disk space consumption, and for those considerations, UTF-16 has clear advantages. But it gives short shrift to the kind of code one has to write to include all the checks for first-of-two, and the costs associated with having to add and maintain such checks. Even if your *data* has no surrogate pairs, the code still needs to be able to process them. This section needs to be more-even-handed WRT the pros/cons of UTF-16 and UTF-32 than it currently is. #20 page 33, Section 2.6 "Unicode Strings" This section seems to be more about UTF-16 strings than it is about generic Unicode strings, as the heading indicates. Either the heading name should change, or the text should be made more general. #21 (EDITORIAL) page 34, 3rd paragraph Instead of the multiple parenthetical phrases, it would read more smoothly as "The Supplementary Multilingual Plane (SMP, or Plane 1) is dedicated to the encoding of lesser-used historical scripts, special-purpose invented scripts, and special notational systems which either could not fit into the BMP or which would be of very infrequent usage. Examples of each type include Gothic, Shavian, and musical symbols, respectively." Later in the same paragraph, "While few scripts are currently encoded into the SMP in Unicode 4.0, there are many major and minor historical scripts do not yet have..." Remove the words "there are" in this sentence. #22 page 44, Section 2.8 "Writing Direction"; 3rd paragraph "East Asian scripts are frequently written in vertical lines that run from top to bottom...Most characters have the same shape and orientation when displayed horizontally or vertically..." The text first says they're written vertically, then it describes what happens when they're displayed either way. How about, "East Asian scripts are frequently written in vertical lines that run from top to bottom, right to left. Such scripts may also be written horizontally, left to right. Most character have the same shape and orientation when displayed either horizontally or vertically..." ********* CHAPTER 3 ********* #23 page 49, third paragraph (and affecting other sections in the chapter) What is the rationale for having the numbering of rules and definitions match that of previous versions of the standard? Does that rationale still make sense given that in V4.0, C1, C2, and C3 all have been superseded, which makes the beginning of the conformance section look odd? Does the rationale still make sense given that some definitions have changed a lot (e.g., consider V3.0's D10 Mirrored property, D10a Case property, and D11 Special character properties vs. V4.0's D10 Property alias, D10a Property value alias, and D11 Default property value)? #24 page 50, References to the Unicode Standard section The section seems backward. Instead of giving specific references to properties, shouldn't the section begin with the generic Unicode Standard info, and then add on the info about properties? #25 page 52, C8; last sentence of final bullet The sentence "In real life, any system may occassionally receive an unfamiliar character code that it is unable to interpret" seems out of place in this context. Remove? #26 page 52, C9; first bullet Provide an example of when implementations may want to distinguish canonical-equivalent sequences. #27 page 53, C10; second bullet from top "Changing the bit or byte ordering when transforming between different machine architectures..." Should that be "Changing the byte ordering..."? When would you be changing bits in a transformation between architectures? Bits will change when transforming between encoding forms, of course. #28 page 53, C10; last bullet "If a noncharacter which does not have a specific internal use..." Are there any noncharacters that do not have specific internal uses? I thought they had all been reserved for special purposes. #29 page 53, C11 "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence." Huh? I don't know what this is trying to say. #30 page 53, C12a; second bullet "...However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form." Is this needed? If it isn't Unicode, and doesn't "purport" to be, why would anyone think there are conformance issues? #31 (EDITORIAL), page 53, C12a, final bullet Two consecutive sentences that begin "For example..." #32 page 54, C12b; first bullet "...when using UTF-16LE,...any initial sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE...rather than as a byte order mark..." What is the rationale for interpreting it this way rather than as an error? #33 page 54, C12b; second bullet The explanation of endianness seems out-of-place in this conformance clause. Move elsewhere? #34 page 56, Section 3.4 "Characters and Encoding" Previously, conformance clause C2 says there are requirements for "code units" and that they formerly were known as "code values." But this section defines "code point" (aka "code position"), "encoded character" (aka "coded character"), and others. Where is "code unit"? Oh wait, I found it at D28a. But it seems there should be a cross-reference between this section and 3.9 (Unicode Encoding Forms) where "code unit" is defined. #35 page 56, Section 3.4 "Characters and Encoding" Where is the definition of "character"? The term is used throughout the book ("grapheme cluster" notwithstanding :-) ). #36 page 57, D5, second bullet from top of page This bullet notes that a single abstract character may have been encoded two different ways, but shouldn't it also note that this is very rare and for compatibility with other encodings? As written, this leaves the impression that double-encodings may be more common than they are. #37 page 57, D5, third bullet from top of page "A single abstract character may also be represented by a sequence of code points -- for example, latin capital g with acute may be represented by the sequence U+0047,...U+0301..." Is this one "encoded character" as D5 is defining it, or two? The definition says it is a mapping "between *an* abstract character and *a* code point" (emphasis added), implying that the abstract character represented in the example is two "encoded characters". Is that right? If so, how does "encoded character" differ from "code point"? If not, why does the definition of "encoded character" talk about "*a* code point"? #38 page 57, D6 The definition is for "coded character representation" and it notes that it is also known as a "coded character sequence". Later on the page, it notes that "Similarly, the term `character sequence' alone designates a `coded character sequence'." Why is the nickname referring to the secondary name for this term? Or, why isn't "coded character sequence" the primary name? #39 page 59 Table 3-1 "Normative Character Properties" The surrounding text notes that some normative properties also are immutable. Are the properties in this table also immutable? Should there be a table of normative and immutable? #40 page 60, D10 and D10a Examples of each of these aliases would be helpful. #41 page 62, D17a, bullet "Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character." Should these be rejected as ill-formed, or is it implementation- defined how to handle this error? Should such info be added? #42 page 62, D18 Three names for this one concept -- Decomposable, precomposed, composite -- is extremely confusing. Are we gaining enough with this new term (decomposable) to justify the confusion we're adding for people who know and understand the previous terms? I don't think so. #43 page 64, D27; Surrogate pair, second bullet The information about what is not legal in UTF-8 seems out of place. Also, similar information is within D28a. #44 page 65, D28a; third bullet Has SJIS already been spelled out? #45 page 65, D28b; third bullet "...it may be necessary to use a code unit sequence (of more than one unit) to represent..." Doesn't the fact that it's a code unit *sequence* mean it is more than one unit? IOW, why is the parenthetical phrase necessary? #46 page 65, D28b; third bullet This bullet gives an example of SJIS when describing how encoded characters can span multiple code units. Wouldn't it be more relevant to have an example of UTF-8 or UTF-16, which also have encoded characters that span multiple code units? #47 page 66, top bullet "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is not `onto'..." I don't understand the text in this bullet. Rephrase? #48 page 66, D29a, second bullet "Code units of different Unicode encoding forms must not be mixed in a single Unicode string." Wouldn't it be clearer to say "A single Unicode string must contain code units from a single Unicode encoding form. It is not permissible to mix forms within a string." #49 page 66, D29b, D29c, D29d, D30b, D30c, D30d Are all these sub-definitions necessary? Are they ever used again in the book? If not -- or if they're only used once or twice -- they should be removed. #50 page 66, D30e What is this defining? There is no term listed. #51 page 67, Table 3-3 (and Tables 3-7, 3-8) The title is "Summary of Unicode Encoding Forms", but this and other tables really give examples of values in different encoding forms. A summary should give broad knowledge; these give specific examples. I recommend changing the title (and intro text) to "Examples of Unicode Encoding Forms". #52 page 69, D36; second bullet from top of page "Before the Unicode Standard, V3.1, the problematic "non-shortest form" byte sequences in UTF-8 were those where BMP characters could be represented in more than one way." This is not quite accurate. The problem was not BMP *characters*, it was the surrogate code points within the BMP. #53 page 70, D39; third bullet The sentence beginning "Its usage..." contains a double negative. How about "Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme." #54 (EDITORIAL) page 73, Section 3.11, second paragraph Two consecutive sentences begin "In the Unicode standard..." #55 page 74, section "Application of Combining Marks" Here is the first use I've seen since the beginning Chapter 2 of the term "grapheme cluster." Either it should be incorporated much more into the text, or the term should be removed. I favor the latter.