L2/02-391 From: Markus Scherer Date: 2002-11-01 Subject: Markus comments on TUS4 ch 1, 2, and 3 ------------------------------------------------------------------------- 02342-a-ch1utc.pdf General problem: Acrobat Reader 5.0 allows me to select but not copy text. Right-clicking shows "copy" as an option, but disabled. This makes it harder to quote for review. Please change the document generation to fix this. Page 1: "TUS 4 is code-for-code identical with ISO/IEC 10646..." -> Should either have version/edition numbers for both standards (TUS 4.0 - 10646 ed 2), or for neither (TUS - 10646 both generic). -> Note that the related statement on page 6 (fully compatible with...) may need to be updated by the time TUS 4.0 is out. ------------------------------------------------------------------------- 02342-b-ch2utc.pdf - PDF page numbers, not book page numbers. I read PDF pages 1..25=book pages 11..35. Reading the beginning of chapter 2 I just got an idea that may fit here or in chapter 1: The standard text should point very early to the web site for errata, updated UAXes/UTRs and generally fresh information. Page 3/fig 2.-1: It seems a little strange or disconnected to show "ch" as a Spanish grapheme: 1. The preceding text discusses "ll" for Spanish and "ck" for German. 2. "ch" is more of a Czech example, isn't it? Also, figure 2-1 still talks about "grapheme" while the text has been updated to "grapheme cluster". (Not an improvement in my personal opinion, but I am sure there was a good reason. I skipped most of that discussion.) Page 5 Universality: "... the Unicode approach assumes that characters and their [codes and] properties are inherently and inextricably associated." -> suggest to insert what I put in brackets above. Otherwise there is a disconnect with the discussion, which explained that the old approach had a late binding between bytes (=codes) and properties. The contrasting statement should therefore also relate codes (code points) to properties. Page 7/fig 2-3: The sixth character is not numbered. Page 12..13 Compatibility Decomposable Characters and Mapping Compatibility Characters: uses the term "composite", should be "decomposable", right? Page 13 2.4 Code Points and Characters: - Why is "codespace" one word but "code point" two? - "Note that some abstract characters may be associated with more than one character ..." -> suggest to change as follows to distinguish better from the following sentence: "Note that some abstract characters may be associated with multiple, separately encoded characters ..." - "The hollow arrow shows a case where an encoded character sequence represents an abstract character, but does [not ]directly encode it." -> 1. [not ] is missing 2. Figure 2-7 does not use solid vs. hollow arrows but an additional box around the two encoded characters. Page 14 Control Codes: Not just TAB has meaning (distinct properties), but also at least LF, FF, NEL - and page 16 Code Point Semantics even says so. Page 15 Noncharacters: "U+FFFF is reserved for internal use (as a sentinel) and should not be transmitted or stored as part of plain text." -> All 66 noncharacters should not be transmitted or stored outside of a processing context. Why single out FFFF? This paragraph should be rewritten to bring them all onto an equal footing. Alternatively, since there is a discussion of restricted interchange following almost immediately, this separate paragraph could be merged into that (basically just adding references to 15.8). Page 15 Restricted Interchange: "Reserved code points ... cannot be interchanged." -> This is wrong, and immediately contradicted by the following explanation. Page 16 Non-overlap: "That means that when someone searchs for the character "D", for example, ..." -> misspelling, replace with "searches" (missing 'e') Page 17, just below figure 2-9: "... because none of the ... code units ... overlap." -> Do units overlap? Suggest to replace "code units" with "code unit ranges" or similar. Note that this explanation is only true for well-formed text. Malformed UTF-8/16 text may yield ambiguous search results, but at least the non-overlap design means that even then one has to inspect only a small number of adjacent code units to disambiguate. "For example, when randomly accessing a string ..." uses the term "low-surrogate". I find it confusing and suggest to generally replace the terms high/low-surrogate with lead/trail surrogate. Page 18 Conformance: "(bullet) The dotted arrow illustrates a ..." -> This bulletted paragraph does not refer to anything around here. Remove. Page 19 footnote: "Use of a BOM is neither required nor recommended ..." -> Should at least point to some later(!) discussion about signature byte sequences, where the "BOM" is used to distinguish Unicode-related charsets, not just the endianness for one or two of them. Page 19/fig 2-11: Suggest to show the bytes separately, not in 4/8-digit clusters. Such clusters confuse the distinction from encoding forms with their 16/32-bit code units. "In Figure 2-11, the columns labeled "Serialized" show how..." -> There are no columns in fig. 2-11! Rephrase/remove. There is an important point missing in this discussion, and it caused an inefficient encoding of Unicode text in CORBA: It needs to be said explicitly that a "serialization" into a format that allows "native" 16/32-bit integer values needs no BOM and should not use one. For example, in CORBA, data gets "serialized" into a message packet format that can contain 8/16/32-bit integers, floating point numbers, structs, vectors, etc. Only at a lower level does CORBA then byte-serialize such packets, and there it has one byte order mark for the entire message packet. This was not clear to the CORBA standards group members. They changed the representation of 16/32-bit Unicode from vectors of 16/32-bit units to byte vectors with UTF-16/32 encoding *scheme* semantics including BOM. This means that a byte-serialized CORBA message packet may now contain UTF-16/32 text in the *opposite* endianness of its surrounding message! I can provide details if necessary. Page 20 UTF-32: Suggest to mention that there are really two reasons for defining and using UTF-32: [reading on, page 22 covers this] 1. It is such an obvious encoding form. 2. It allows to shoehorn Unicode into wchar_t/C stdlib API functions that require that string base units and single-character types are the same. Page 21 UTF-8 says "variable-length" while UTF-16 said "variable-width" Page 23 Unicode Allocation: "Grouping encoded characters by script ... as for conversion tables, character property tables, or fonts." -> "conversion tables" seems odd. Suggest "conversion schemes" or similar. Page 24: "... there are many major and minor historic scripts [that ]do not yet have ..." -> missing word, add above-bracketed [that ] "The Supplementary Special-purpose Plane (SPP, or Plane 14) ..." -> Titlecase "Purpose" because it is the middle 'P' in "SPP" Areas and Blocks: "... to divide up the code charts and do not necessary imply anything else about ..." -> "necessarily" ("-ily") ------------------------------------------------------------------------- 02342-c-ch3utc.pdf - PDF page numbers, not book page numbers. I read PDF pages 1..32=book pages 49..80. Page 4, C6: C6 is slightly contradicted - usefully - by the assignment of non-trivial properties to some unassigned code points. For example, some unassigned code points have Bidi properties R or AL, while almost all others have L. Such default properties can help processing, in an implementation of some Unicode version, of characters that may be assigned to these code points in future versions of the standard. This is well worth mentioning. Page 4, C10: Suggest to put the "if that process purports..." clause first to make it easier to understand. Also: Should allow to remove malformed sequences (e.g., unpaired surrogates [which are code points], non-shortest forms). Page 5, C12a: The first bullet prescribes at which code unit to restart decoding. This may or may not be a good idea depending on whether a byte is missing or whether the trail byte was modified in transmission (could be single-bit error). It also does not seem productive to prescribe exact behavior after a point of error. Existing implementations differ in these details and should not be (even suggested to be) non-conformant because of different error and resynchronization behaviors. Page 7 Unicode Standard Annexes: Why are 14 & 29 (line break vs. text boundaries) not rolled into one? They seem co-dependent. Page 8 D3: third bullet uses "grapheme" - should be "grapheme cluster"? Page 10 D8a: second bullet "This allows rooms for ..." -> make "room" singular Page 11 D9: "...overridable by higher-level protocols, because their intent is [to ]provide a common basis..." -> insert missing [to ] Page 16, 3.8 Surrogates: As I said in my comments on chapter 2, I find the terms "high-surrogate" and "low-surrogate" confusing myself, and find that it confuses others, too. I suggest to replace these terms with "lead surrogate" and "trail surrogate", respectively. I now see that D27 shows similar terms as alternatives. I suggest to replace the confusing ones as above. Page 16, 3.9 Unicode Encoding Forms: "Each encoding form maps a defined range of Unicode code points to code unit sequences." -> "a defined range" immediately will lead a reader to ask "why not all?". I suggest to rephrase as "Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences." Page 18 etc.: Bravo! Nice definition of terms. It might be worth mentioning that all 16-bit values are allowed in Unicode 16-bit strings, but not all 8/32-bit values are allowed in Unicode 8/32-bit strings. Page 20, especially D31 2nd bullet: Talking about code unit values greater than some value leaves a door open for signed integers below 0, especially for UTF-32. Suggest to say "code unit values outside of the range 0..10FFFF are ill-formed." (For UTF-8/16 the interpretation must implicitly be of unsigned integers. May be worth saying.) Page 20..21, D36: The code unit sequence is wrong. The lowest well-formed 4-byte sequence is F0 90 80 80, and the first trail byte here is 80. Probably only this one byte is wrong (in all instances in the entire chapter!), but better check the whole sequence. I just noticed that the UTF-8 byte sequence for U+0430 is wrong, too. It is shown as D1 90 but must be D0 B0. This is wrong in all instances - many in chapter 3! Page 28 under table 3-9: The first example says "a+underdot+diaeresis==a+underdot+diaeresis" which is trivial (same sequences on both sides). I assume one of these a+underdot should be an a-underdot (dash instead of plus). Actually, since the discussion starts with decomposed sequences, there should be no decomposables in the example? Page 30 Standard Korean Syllables: I find the notation of "^X" for "something else than X" unintuitive. Suggest to use strike-through or slash-through or maybe X-macron. Page 32 Hangul Syllable Names: Compared with the preceding algorithms, this is strangely short. Suggest to at least give the name of the UCD file with the short names. ------------------------------------------------------------------------- 02342-d-ch4utc.pdf - PDF page numbers, not book page numbers. A nice, short, sweet chapter... Page 2 4.1 UCD: Bullet "Age (which version the code point was first assigned)" -> I think this should be "designated" instead of "assigned" because code points like surrogates are "designated". Page 3 4.4 Directionality: Mentions two strong types, L and R. AL is the third strong type, right? Needs mentioning and distinction from R (separate script lists). Page 4 4.5 General Category: "See Table 2-2 for the relationship [of ]General Category values..." -> insert missing [to ] Page 5 4.6 Numeric Value: Example in parentheses says "(1 + 5 = 15 fifteen, but I + V = IV four)" -> Bad choice of "+" to indicate concatenation of _number_ characters. 1+5=6... Suggest to use other concatenation symbols like "(<1, 5> = 15 fifteen, but = IV four)" or similar. Page 6 4.8 Unicode 1.0 Names: As far as I remember recent discussion on one of the Unicode mailing lists, the Unicode 1.0 Name field is sometimes used for strings that were not actually names of those characters in Unicode 1.0, but are more of an annotation. Needs to be mentioned.