UTC/1999-004

From: Paul Hoffman / IMC [phoffman@imc.org]

Sent: Thursday, January 28, 1999 8:34 PM

Subject: UTC 3.0 draft comment

Transformations and Serialization

Page 1, sixth paragraph

UTF-8 isn't really an encoding form: it is a transformation. Thus, I think that either the whole paragraph should be eliminated (transformations aren't really necessary this early in the book), or it needs to be updated to say "Unicode provides for four ways to serialize characters:" followed by the four transformations. I prefer eliminating the paragraph.

Page 16, Section 2.3 Encoding Forms

There is sure to be confusion between the names of the encoding forms here and the names of the IANA charsets, given that they are identical. I suggest adding the following paragraph after the first paragraph of this section: "The names of the Unicode encoding forms are the same as those used in other contexts. Notably, the IANA maintains a well-respected registry of <ital>charset names</ital> which include the same names as the encoding forms used here. Although the encoding forms here and the descriptions of the charsets registered with IANA are very similar, there are important differences between the requirements for each."

Page 16, UTF-16

There is no description of UTF-16BE or UTF-16LE here. They very much should be here, and the last sentence of the preceding paragraph should be updated to say "...uses four encoding forms...".

Page 16, UTF-8

The statement "UTF-8 is fully conformant with the Unicode Standard and ISO/IEC 10646" is technically true but may be misleading to an implementor. The definition of UTF-8 given in Table 2-2 only covers codes that can be represented in Unicode; however, I have been told that 10646's definition of UTF-8 shows how to code for any character in UCS-4. So, Unicode's UTF-8 is conformant, but it is a subset. I propose either (a) changing this sentence to "The UTF-8 defined here is fully conformant with the Unicode Standard and is a subset of the UTF-8 defined in ISO/IEC 10646" or (b) adding a sentence to the end of the third paragraph of the UTF-8 section that says "Note that the definition of UTF-8 in ISO/IEC 10646 can encode characters that are outside the range of Unicode characters."

Page 38, Section 3.8 Transformations

In the first paragraph, the word "other" is out of place, since the section describes all of the representations in the standard.

Page 38, Section 3.8 Transformations

In the paragraph that begins "For example," it says that a UTF-8 conformant process must do something if it comes across an illegal datatype sequence. However, that requirement is not covered in the full description of UTF-8 on page 16. Either the earlier UTF-8 description has to be updated, or this sentence should be changed to say "a UTF-8 conformant process should...".

Page 39, D35

The wording here does not make it clear that text serialized with UTF-16 is not required to have a BOM. I propose the following sentence be added to the end of the definition: "A serialization of Unicode values into UTF-16 may or may not begin with a BOM. If the serialization does not begin with a BOM, the bytes of the serialization must be in Big Endian format." Also, a third example should be added as "or <00 4D 0061 00 72 00 6B>".

Page 296, UTF-8

Because this section correctly points out that a 10646 UTF-8 value might have six bytes, I think it would be good to add a note to third paragraph that says: "UTF-8 as defined in ISO/IEC 10646 can encode characters outside the range that can be encoded in UTF-8 as defined by Unicode."

Surrogates

Page 2-3, Section 1.1 Coverage, last paragraph:

This is the only place we call surrogates an "extension mechanism". Also, the reference to UTF-16 is out of place. I would change this to: "There are ... code values through the use of surrogate pairs. Surrogate pairs make another 131,072..."

Page 37, D27, second bullet

The word "rare" is out of place. The are not rare: they don't exist. But, when they start to exist in the future, there might be thousands of them. I would simply remove "rare" from the first sentence. Also, because most of us are very sure that there will be non-BMP characters in the future, I would add to the end of the second sentence ", but it is widely expected that such characters will be assigned in not-distant future."

Page 89, Section 5.4 Handing Surrogate Characters

In the first sentence, there is the same "rare" problem as before. In the second sentence, the assumption that "only infrequently used characters will be assigned..." is, from the private opinions I have heard from many people, not a widely-held opinion. That clause should be struck and the sentence left at "Vendors may choose to support or not support surrogate pairs based on market conditions."

Page 89, Section 5.4 Handing Surrogate Characters

The last sentence of the first paragraph is almost begging implementors not to support surrogate pairs. Regardless of what we think about out-of-BMP characters, I believe that this kind of editorial urging (particularly the italics) is out of place in this standard. I haven't come across anyone who thinks that there won't be non-BMP characters assigned within five years. The entire sentence should be deleted. (My preference would be to add a sentence that says "Because it is widely believed that characters outside the 16-bit range currently used for all Unicode characters, implementors should strongly consider being prepared for such characters by implementing full support for surrogate pairs." Maybe in 3.1...)

Page 275, Section 13.4 Surrogates Area

In the first paragraph, there is the same set of problems (the use of "rare", the italics, the lack of acknowledgment that they are coming) as earlier.

BOM

Page 277, Byte Order

The third paragraph ("Systems that employ...") has very dangerous advice. Filtering out BOMs without knowing whether or not another process needs to know the exact characters in the stream (such as for digital signatures) is sure to cause errors. I suggest adding the following after the first sentence: "Note that some systems must retain all of the characters in the original stream; in these cases, byte order marks should never be removed." Also, in the last sentence of this paragraph, I would strengthen it by changing "should not" to "must not".

Page 277, Byte Order

The fourth paragraph ("To represent...") has a technically incorrect statement. The third sentence says "...a byte order mark is never used or needed...". We have no idea whether or not a byte order mark will be used; they are used in the XML standard as a heuristic to find the charset. I think "used or" should be removed.

Paul Hoffman, Director
Internet Mail Consortium