L2/06-302 Date/Time: Wed Aug 23 20:15:47 CDT 2006 Contact: nobody@xyzzy.claranet.de Name: Frank Ellermann Report Type: Error Report Opt Subject: UTS #22 (part 2 of 2) Hi, below you find my further observations (mainly typos or omissions) in UTS #22 (CharMapML) revision 5: 1 - In chapter 3.1 "Header" the attribute 'combiningOrder' isn't explained or used anywhere. Probably it's about a default combiningOrder="after" for 'as in Unicode', or a combiningOrder="before" for legacy non-spacing characters. I've no idea what a parser could do with this info. 2 - In chapter 1.2 "Completeness" it's specified that only u+FFFD can be explicitly mapped to ASCII SUB 0x1A or other legacy SUBSTITUTE characters. The "only" is misleading, u+001A would be also mapped to a legacy SUBSTITUTE like 0x1A (or 0x7F in an IBM-rotated SBCS). 3 - In chapter 1.2 it's specified that all control values C0, DEL, or C1 must be explicitly mapped. It's not clear if that affects only leading type="FIRST" bytes, or also trailing bytes. Does it mean that a mapping for say UTF-8 must explicitly contain for C1, or is it good enough if these C1 bytes have no type="FIRST" state ? Or for UTF-8 trailing bytes, should C0 and DEL explicitly noted as INVALID ? Example 5.2 doesn't do this, and in 3.3 the default (= implicit) result is INVALID, unless it's explicitly different. Maybe there's a discrepancy between what's considered as "explicit" in 1.2 and later. The complete concept of explicit INVALID states is unclear. In say UTF-8 "F0 BF BF C2 A0" the "C2" is invalid, and this might result in u+FFFD u+00A0 with one u+FFFD for "F0 BF BF", an explicit INVALID state can't change this. 4 - This "explicit" issue also affects another point in chapter 1.2: "incomplete and illegal sequences must be explicitly indicated". Later that apparently means that the parser either terminates in a VALID state, or it found some incomplete trailing garbage, or it found and reported implicitly or explicitly invalid input before. 5 - The difference between UNASSIGNED and INVALID is somewhat obscure. 3.4.2 offers: "If a byte sequence is UNASSIGNED in the validity specification, it is invalid". That sounds like "UNASSIGNED" is a synonym for "INVALID". 3.3 states: "neither max nor UNASSIGNED are necessary. They could both be determined by analyzing the assignment statements in the table." The assignment statements cover valid byte sequences, anything else is implicitly invalid, it's impossible to determine if it's additionally unassigned. So that's apparently wrong, the UNASSIGNED value could be necessary, only its purpose is unclear. Vague impression, maybe an older version of this standard had only one of INVALID and UNASSIGNED. Getting rid of the explicit INVALID everywhere could help. With that I'd arrive at a clean concept of UNASSIGNED, in the UTF-8 "F0 BF BF C2 A0" example "F0 BF BF C2" is implicitly invalid, i.e. a parser could substitute u+FFFD for the begin "F0 BF BF", and then try "C2 A0". On the other hand for an explicitly UNASSIGNED sequence the parser would substitute u+FFFD for the complete UNASSIGNED sequence, not only for its begin. In other words the (implicitly) invalid vs. (explicitly) UNASSIGNED business is about error recovery. And any (explicitly) INVALID is unnecessary / confusing / pointless... or I simply miss the point. 6 - In 3.3 a 'max' attribute is specified for the element with next="VALID". For cases like UTF-8 it would be more convenient to associate 'max' with the various leading type="FIRST" states. E.g. state type="FIRST" s="ED" max="D7FF" (without next="VALID"), if multiple 'max' values are supposed to be used for plausibilitly checks (?). If actually only one 'max' value is used, maybe as an accelerator for mappings from Unicode (?), then this unique 'max' should be an attribute of the element, not bound to a . If it's too late to change 'max' please add an example where more than one 'max' could make sense. A decent CharMapML validator can hopefully handle elements, getting it right in conjunction with multiple 'max' values could be difficult. 7 - The EBCDIC example in chapter 3.3.2 contains the following lines: Maybe s/FEFE/FFFF/ or s/ff/fe/ twice, I can't tell what's correct. 8 - In 3.4 I couldn't figure out the 'version' details, the example is: I've omitted the c="?" for u+FF64. What's old and what's new in this case ? Later you say "attribute v (optional) specifies the version. This is a year, followed optionally by a letter". That matches a similar explanation about the 'id' attribute in chapter 3.1, and then the v="source-someName-1995" would be incorrrect. 9 - "The default value is zero" (with a bold 'zero'). What you want is likely an ASCII order of 0000 < 1993b < 1995, not 1995 < zero, please replace zero by 0 or clearer 0000 if that's the case. A - Matter of taste, for a set "undetermined", "neither", "NFC", and "NFD" I'd pick "both" as last value, not "NFC_NFD". For curious readers an "NFD" example could be interesting, or is that only a theoretical value ? B - In chapter 3.3 (validity specification) I read "VALID indicates valid completion and is the default value for the state element." What does this mean in conjuction with 'next CDATA #REQUIRED' in the DTD ? For a default I'd expect 'next CDATA "VALID"' in the DTD. On the other all examples use explicit next="VALID" values, maybe delete "and is is the default value for the state element." C - In chapter 3.4 (assignments) it took me rather long to understand what 'bMin' and 'bMax' actually do. This is a cute notation, far from my first impression "the defaults should be bFirst and bLast". In a horror scenario like UTF-1 'bMin' and 'bMax' allow to reduce the size of the charMap by almost 50%, e.g. Maybe add the only five ranges required for UTF-8 to chapter 5.2.2, here are the two ranges with bMin < bFirst < bLast < bMax: Regards, Frank -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- In case anyone was wondering what part 1 was... Date/Time: Mon Aug 21 01:57:44 CDT 2006 Contact: bugzilla@xyzzy.claranet.de Name: Frank Ellermann Report Type: Error Report Opt Subject: TR22 legal state Hi, a minor typo in UTS #22 revision 5, http://www.unicode.org/reports/tr22/tr22-5.html example 5.2.1 uses six elements instead of . Regards, F.Ellermann