L2/02-384
Subject: Comments on Unicode 4.0 draft chapters
From: Sandra Martin O'Donnell
      Hewlett-Packard Company
Date: Oct 31, 2002

*********
CHAPTER 2
*********

#1  Page 12, third paragraph from the bottom.
The paragraph discusses "grapheme clusters," and points readers to
UAX #29 (which, by the way, currently is a DUTR, not an approved standard
annex). Given the confusion about characters, text elements, glyphs, etc.,
if this term really is to be used the way "a user thinks of as a `character'",
then it needs to be in the book (Chapter 3?), rather than in an annex.

#2  Page 12, third paragraph from the bottom.
Illustrating the terminology confusion, the paragraph states that
"Figure 2-1 illustrates the relationship between abstract characters
and grapheme clusters", but Figure 2-1 is titled "Text elements and
characters." I assume grapheme clusters is supposed to equate to characters,
but then what does abstract characters equate to? It's not the text elements
in the figure. For example, I don't think anyone believes the word "cat" is
an abstract character, but it can be a text element. 

There is a lot of confusion about grapheme clusters. I recommend restoring
the original terminology used in Figure 2-1 (Text elements and characters).

#3  (NIT) page 13, next to last bullet on page, final sentence
Most other references in the book point to a specific section rather than
a section *and also* a chapter. To improve consistency, it should be
"See Section 8.2, Arabic, and Section 9.1, Devanagari for detailed
examples of this situation" rather than "See Section 8.2, Arabic, in
Chapter 8, Middle Eastern Scripts, and Section 9.1, Devanagari in
Chapter 9, South Asian Scripts, for detailed examples of this situation".

#4  page 15, third paragraph under "Universality".
Remove this paragraph. It discusses implementation issues that are
not always relevant to a Unicode-enabled application. Many technologies
need to continue to support non-Unicode encodings, and find that a
code set independent design is very efficient for handling Unicode and
non-Unicode. Also, Unicode itself has multiple encoding forms that
mean it is necessary to understand and perhaps do late-binds based on
the particular encoding form. 

These implementation-specific issues are not appropriate in the
discussion of Unicode's universal repertoire.

#5  page 15, "Characters, Not Glyphs" section
Just a note that this and all other sections in the rest of this chapter
(and Chapter 3) only discuss "characters", not "grapheme clusters". If
there is a need for the new term, it should find its way elsewhere into the
book. The fact that it has not found its way in is an indication that it
is not needed, or is confusing.

#6  page 20, "Decompositions" section
This is not in the list of Unicode design principles, but it is at the
same level (in terms of heading size) as the other principles. Is this
a principle? If so, it needs to be added to the list. If not, it needs
to be demoted, headings-wise, in the text.

#7  page 21, Figure 2-6.
I assume these are known glyph errors for the combining characters.

#8  page 22, first sentence in "Compatibility Characters" section
Simpler wording would be "Compatibility characters are those that would not
have been encoded because they are in some sense variants of characters that
already have encodings in the Unicode Standard." The parenthetical phrases
are unnecessary, and there seems no need to introduce and use the term
"normal" for non-compatibility characters.

#9  page 23, first full paragraph beginning "In the past..."
While this information is correct, I wonder how many non-experts will
glean anything from it? Are we writing the book for gurus, and therefore
need maximum precision, or for average developers, who need clear English?

#10  page 23, last paragraph; 2nd sentence
The text is unclear to me. It states: "Note that some abstract characters
may be associated with more than one character (that is, be encoded
"twice")." Should that read "...with more than one encoded character..."?

#11  page 24, Figure 2-7
I assume some of the arrows should be solid, rather than all being hollow.

#12  page 24, paragraph beginning "When referring..."; 2nd sentence
Should that be "Encoded characters can be referred to by their code
point only, but to prevent ambiguity..." rather than the current "Encoded
characters can also be referred to by their code point, but to prevent
ambiguity..."

#13  page 26, 2nd paragraph in section "Encoding Forms"
The text that says "...precisely-defined encoding forms specify how each
integer (code point) for a Unicode character is to be expressed as a sequence
of one or more code units." This is clear, but Chapter 3 still defines a
Unicode scalar value. What is the difference between "code point", as described
here, and "Unicode scalar value"?

#14  page 28, first bullet
This seems out of place. Should it be removed?

#15  page 28, "Encoding Schemes" subsection
The terminology for encoding forms and encoding schemes is SOOOO close
that they are easily confused. Here are some suggestions:

+  Instead of defining 7 encoding schemes, some of which have names
that are identical to existing encoding forms, the forms could continue
to exist, and information about serializing the forms could simply be
added that describes how these forms are used. Thus, the description
of UTF-16 could add information about how these are serialized on
big- and little-endian architectures, and how the BOM is handled/recognized.
The same would be done for UTF-32.

Okay, I hear the howls now...so, Alternative 2 is:

+  Change the name from "encoding scheme" to "serialization scheme".
This would alleviate the confusion between "encoding form" and "encoding
scheme". Earlier proposals to call these things CEF and CES (Character
Encoding Form, and Character Encoding Scheme, respectively) suffer from
the same confusion as the existing terms. The "encoding schemes" have to
do with the way bytes are serialized on computer systems; they have little
to do with encoding.

#16  page 29, first full paragraph
The text begins "Note that some of the Unicode encoding schemes have
the same labels as the three Unicode encoding forms." This is further
evidence, IMO, that we either need to remove this extra distinction (my
first preference), or find names that are not so easily confused.

#17  page 29, paragraph below Figure 2-11
"In Figure 2-11, the columns labeled "Serialized" shows..." There is no
column with that label in the figure.

#18  page 30, 3rd paragraph in UTF-32 section
"The value of each UTF-32 code unit corresponds exactly to the Unicode
code point value." Regarding my earlier comment about the difference between
"code point" and "Unicode scalar value", here's an example where it would
seem logical to use "Unicode scalar value" as it's defined in Ch. 3. Do
the two terms differ? Do we need the separate terms? 

#19  page 32, "Comparison of the Advantages of UTF-32, UTF-16, and UTF-8"
Gee, Dad, so I guess you've always liked UTF-16 best, right?

IMO, this section is biased toward UTF-16 and against UTF-32. Since all
encoding forms are co-equal within Unicode, the text should be more
evenly balanced. The text currently says, "UTF-16 is the internal
processing code of choice for a majority of implementations supporting
Unicode." I know that the majority of *Unix* implementations support
Unicode via UTF-32 (e.g., Solaris, Tru64 Unix, Linux, and HP-UX). Is it
really true that UTF-16 is in the majority? Even if it is, is that relevant?

The text talks about pros and cons with respect to memory and disk space
consumption, and for those considerations, UTF-16 has clear advantages.
But it gives short shrift to the kind of code one has to write to include
all the checks for first-of-two, and the costs associated with having
to add and maintain such checks. Even if your *data* has no surrogate
pairs, the code still needs to be able to process them.

This section needs to be more-even-handed WRT the pros/cons of UTF-16
and UTF-32 than it currently is.

#20  page 33, Section 2.6 "Unicode Strings"
This section seems to be more about UTF-16 strings than it is about
generic Unicode strings, as the heading indicates. Either the heading
name should change, or the text should be made more general.

#21  (EDITORIAL) page 34, 3rd paragraph
Instead of the multiple parenthetical phrases, it would read more
smoothly as "The Supplementary Multilingual Plane (SMP, or Plane 1) is
dedicated to the encoding of lesser-used historical scripts, special-purpose
invented scripts, and special notational systems which either could not fit
into the BMP or which would be of very infrequent usage. Examples of each
type include Gothic, Shavian, and musical symbols, respectively." 

Later in the same paragraph, "While few scripts are currently
encoded into the SMP in Unicode 4.0, there are many major and minor
historical scripts do not yet have..." Remove the words "there are"
in this sentence.

#22  page 44, Section 2.8 "Writing Direction"; 3rd paragraph
"East Asian scripts are frequently written in vertical lines that run
from top to bottom...Most characters have the same shape and orientation
when displayed horizontally or vertically..." The text first says they're
written vertically, then it describes what happens when they're displayed
either way. How about, "East Asian scripts are frequently written in 
vertical lines that run from top to bottom, right to left. Such scripts
may also be written horizontally, left to right. Most character have the
same shape and orientation when displayed either horizontally or vertically..."

*********
CHAPTER 3
*********

#23  page 49, third paragraph (and affecting other sections in the chapter)
What is the rationale for having the numbering of rules and definitions
match that of previous versions of the standard? Does that rationale
still make sense given that in V4.0, C1, C2, and C3 all have been
superseded, which makes the beginning of the conformance section look
odd? Does the rationale still make sense given that some definitions have
changed a lot (e.g., consider V3.0's D10 Mirrored property, D10a Case
property, and D11 Special character properties vs. V4.0's D10 Property
alias, D10a Property value alias, and D11 Default property value)?

#24  page 50, References to the Unicode Standard section
The section seems backward. Instead of giving specific references to 
properties, shouldn't the section begin with the generic Unicode Standard
info, and then add on the info about properties?

#25  page 52, C8; last sentence of final bullet
The sentence "In real life, any system may occassionally receive an
unfamiliar character code that it is unable to interpret" seems out of
place in this context. Remove?

#26  page 52, C9; first bullet
Provide an example of when implementations may want to distinguish
canonical-equivalent sequences.

#27  page 53, C10; second bullet from top
"Changing the bit or byte ordering when transforming between different
machine architectures..." Should that be "Changing the byte ordering..."?
When would you be changing bits in a transformation between architectures?
Bits will change when transforming between encoding forms, of course.

#28  page 53, C10; last bullet
"If a noncharacter which does not have a specific internal use..."
Are there any noncharacters that do not have specific internal uses? I
thought they had all been reserved for special purposes.

#29  page 53, C11
"When a process interprets a code unit sequence which purports to be in a
Unicode character encoding form, it shall interpret that code unit sequence
according to the corresponding code point sequence."
Huh? I don't know what this is trying to say.

#30  page 53, C12a; second bullet
"...However, the conformance clauses do not prevent processes from operating
on code unit sequences that do not purport to be in a Unicode character
encoding form." Is this needed? If it isn't Unicode, and doesn't "purport"
to be, why would anyone think there are conformance issues?

#31  (EDITORIAL), page 53, C12a, final bullet
Two consecutive sentences that begin "For example..."

#32  page 54, C12b; first bullet
"...when using UTF-16LE,...any initial <FF FE> sequence is interpreted as
U+FEFF ZERO WIDTH NO-BREAK SPACE...rather than as a byte order mark..."
What is the rationale for interpreting it this way rather than as an error?

#33  page 54, C12b; second bullet
The explanation of endianness seems out-of-place in this conformance clause.
Move elsewhere?

#34  page 56, Section 3.4 "Characters and Encoding"
Previously, conformance clause C2 says there are requirements for "code
units" and that they formerly were known as "code values." But this
section defines "code point" (aka "code position"), "encoded character"
(aka "coded character"), and others. Where is "code unit"? Oh wait,
I found it at D28a. But it seems there should be a cross-reference between
this section and 3.9 (Unicode Encoding Forms) where "code unit" is
defined.

#35  page 56, Section 3.4 "Characters and Encoding"
Where is the definition of "character"? The term is used throughout the
book ("grapheme cluster" notwithstanding :-) ).

#36  page 57, D5, second bullet from top of page
This bullet notes that a single abstract character may have been encoded
two different ways, but shouldn't it also note that this is very rare
and for compatibility with other encodings? As written, this leaves
the impression that double-encodings may be more common than they are.

#37  page 57, D5, third bullet from top of page
"A single abstract character may also be represented by a sequence of
code points -- for example, latin capital g with acute may be represented
by the sequence U+0047,...U+0301..."
Is this one "encoded character" as D5 is defining it, or two? The
definition says it is a mapping "between *an* abstract character and *a*
code point" (emphasis added), implying that the abstract character
represented in the example is two "encoded characters". Is that right? If
so, how does "encoded character" differ from "code point"? If not, why
does the definition of "encoded character" talk about "*a* code point"?

#38  page 57, D6
The definition is for "coded character representation" and it notes that
it is also known as a "coded character sequence". Later on the page, it
notes that "Similarly, the term `character sequence' alone designates
a `coded character sequence'." Why is the nickname referring to the secondary
name for this term? Or, why isn't "coded character sequence" the primary
name?

#39  page 59 Table 3-1 "Normative Character Properties"
The surrounding text notes that some normative properties also are
immutable. Are the properties in this table also immutable? Should there
be a table of normative and immutable?

#40  page 60, D10 and D10a
Examples of each of these aliases would be helpful. 

#41  page 62, D17a, bullet
"Defective combining character sequences occur when a sequence of combining
characters appears at the start of a string or follows a control or format
character." Should these be rejected as ill-formed, or is it implementation-
defined how to handle this error? Should such info be added?

#42  page 62, D18
Three names for this one concept -- Decomposable, precomposed, composite --
is extremely confusing. Are we gaining enough with this new term
(decomposable) to justify the confusion we're adding for people who
know and understand the previous terms? I don't think so.

#43  page 64, D27; Surrogate pair, second bullet
The information about what is not legal in UTF-8 seems out of place. Also,
similar information is within D28a.

#44  page 65, D28a; third bullet
Has SJIS already been spelled out?

#45  page 65, D28b; third bullet
"...it may be necessary to use a code unit sequence (of more than one unit)
to represent..." Doesn't the fact that it's a code unit *sequence* mean
it is more than one unit? IOW, why is the parenthetical phrase necessary?

#46  page 65, D28b; third bullet
This bullet gives an example of SJIS when describing how encoded characters
can span multiple code units. Wouldn't it be more relevant to have an
example of UTF-8 or UTF-16, which also have encoded characters that span
multiple code units?

#47  page 66, top bullet
"The mapping of the set of Unicode scalar values to the set of code unit
sequences for a Unicode encoding form is not `onto'..."
I don't understand the text in this bullet. Rephrase?

#48  page 66, D29a, second bullet
"Code units of different Unicode encoding forms must not be mixed in a
single Unicode string." Wouldn't it be clearer to say "A single Unicode
string must contain code units from a single Unicode encoding form. It is
not permissible to mix forms within a string."

#49  page 66, D29b, D29c, D29d, D30b, D30c, D30d
Are all these sub-definitions necessary? Are they ever used again in the
book? If not -- or if they're only used once or twice -- they should be
removed.

#50  page 66, D30e
What is this defining? There is no term listed.

#51  page 67, Table 3-3 (and Tables 3-7, 3-8)
The title is "Summary of Unicode Encoding Forms", but this and other
tables really give examples of values in different encoding forms.
A summary should give broad knowledge; these give specific examples.
I recommend changing the title (and intro text) to "Examples of
Unicode Encoding Forms".

#52  page 69, D36; second bullet from top of page
"Before the Unicode Standard, V3.1, the problematic "non-shortest form"
byte sequences in UTF-8 were those where BMP characters could be
represented in more than one way." This is not quite accurate. The problem
was not BMP *characters*, it was the surrogate code points within the BMP.

#53  page 70, D39; third bullet
The sentence beginning "Its usage..." contains a double negative. How
about "Its usage at the beginning of a UTF-8 data stream is neither
required nor recommended by the Unicode Standard, but its presence does
not affect conformance to the UTF-8 encoding scheme."

#54  (EDITORIAL) page 73, Section 3.11, second paragraph
Two consecutive sentences begin "In the Unicode standard..."

#55  page 74, section "Application of Combining Marks"
Here is the first use I've seen since the beginning Chapter 2 of the
term "grapheme cluster." Either it should be incorporated much more
into the text, or the term should be removed. I favor the latter.