L2/10-157 Author: C. E. Whitehead Subject: Comments on UTS #10 Date: April 20, 2010 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Tue Apr 20 11:53:20 CDT 2010 Contact: cewcathar@hotmail.com Name: CE Whitehead Report Type: Public Review Issue Opt Subject: Proposed Update Unicode Technical Standard #10 Hi, below are my comments on the draft at: http://www.unicode.org/reports/tr10/tr10-21.html (Proposed Update Unicode Technical Standard #10 Unicode Collation Algorithm ; last updated April 12; my apologies I read through all but sections 1 and 2 with no concentration skills functioning and hope at least I got all the proofreading; I have more questions on the Arabic character tanween-al-fatah in terms of collation -- but will send any comments separately) * * * 1.6 Interleaved Levels "The problem with this approach is that high-level differences in the second field are swamped by minute differences in the first field, which results in unexpected ordering for the first names." {COMMENT I think paying attention to accents and such in the last name in a sort of names is more the norm -- in English at least -- than your text suggests, though I have not really researched this yet; however: D'aure', Mary (that is, D’auré, Mary) in my system sorts after not before D'aure', Nicole (that is D’auré, Nicole -- but I am avoiding special characters in case they don't go through) likewise : diSi'lva Fred (or diSílva Fred) sorts after diSilva John NOTE: The way I learned to alphabetize, the last name is so much more important than the first name, even for accents and such } * * * 1.8 Item 4 "4.Collation order is not preserved under concatenation or substring operations, in general. For example, the fact that x is less than y does not mean that x + z is less than y + z. "This is because characters may form contractions across the substring or concatenation boundaries. "x < y ↛ xz < yz x < y ↛ zx < zy xz < yz ↛ x < y zx < zy ↛ x < y" { COMMENT/QUESTION: The above does not make complete sense to me; actually if x & y & z are different words, if they are merged for sorting purposes I am not sure that contracted letters from separate words should be combined, that is normally letters would not form contractions across word boundaries -- so if that is what you are saying, I am in disagreement.} * * * 5.1 Table; "numeric" "If set to on, any sequence of Decimal Digits (General_Category = Nd in the Unicode Character Database [UAX44]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123 " {COMMENT/QUESTION: I note that some languages do not have place value anyway -- but 'numeric' is only for decimal so the comments on 'no' and 'nd' classifications at: http://www.unicode.org/mail-arch/unicode-ml/y2010-m04/0051.html remain irrelevant? } * * * PROOFREADING (main issues are 3.3.1 and 4.4; my other comments are relatively trivial here) 1.9 par 2 , last sentence "Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated." { COMMENT: this is trivial, but change "comparison between the strings" to "comparison of the strings"? } => "Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison of the strings for which they were generated." * * * 3.3.1 Expansions 2nd sentence (though I've quoted both) "The Latin letter æ is treated as an independent letter by default. Collations such as English, which may require treating it as equivalent to an sequence, can tailor the letter to map to a sequence of more than one collation elements, such as in the following example:" {**COMMENT: it's not "more than one elements" because here "element" must agree with "one" you have "one element" and then you append "more than" to the beginning to make "more than one element" } => "The Latin letter æ is treated as an independent letter by default. Collations such as English, which may require treating it as equivalent to an sequence, can tailor the letter to map to a sequence of more than one collation element, such as in the following example:" * * * 3.3.2 Contractions, 3rd par ( 1rst sentence though I've quoted both ) "Any character (such as soft hyphen) that is not completely ignorable between two characters of a contraction will cause them to sort as separate characters. Thus a soft hyphen can be used to separate and cause distinct weighting of sequences such as Slovak ch or Danish aa that would normally weight as units." { COMMENT: "such as soft hyphen" => "such as the soft hyphen" -- I think the definite article is essential in English here } => "Any character (such as the soft hyphen) that is not completely ignorable between two characters of a contraction will cause them to sort as separate characters. Thus a soft hyphen can be used to separate and cause distinct weighting of a sequence such as Slovak ch or Danish aa that would normally weight as units." * * * 3.5, last sentence "This is done by providing these sequences as many to many mappings in the Collation Element Table." { COMMENT: nit picky issue: "providing these sequences as many-to-many mappings" is o.k. but sounds un-English; "specifying these sequences as many to many mappings" ?? sounds more English } => " This is done by specifying these sequences as many to many mappings in the Collation Element Table." * * * 4.3 par 2 "An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison." { COMMENT: nit-picking again, and purely about stylistics: it might be better to start by mentioning the goal of setting the maxium level to a level lower than the available levels -- that is to mention that there is a way to level differences in strings that are not considered major.} => "Optionally, an implementation may ignore differences at higher levels (for example, differences in diacritics, case, etc.). The way to do this is to allow the maximum level to be set to a smaller level than the available levels in the collation element array." * * * 4.4 Note "Note: At this point we can explain the reason for only allowing well-formed weights. If ill-formed weights were allowed, the ordering of elements can be incorrectly reflected in the sort key. For example, suppose the secondary weights of the Latin characters were zero (ignorable) and that (as normal) the primary weights of case-variants are equal: that is, a1 = A1. Then the following incorrect keys would be generated:" { **COMMENT: verb tense is off twice; you switch haphazardly between verb forms in the past (which is used for unreal conditions in the present yes but you use it with a past-tense form verb in the second clause too always), 'were' does not go with 'can' 'were' does not go with 'are' yes you can say "can" at the beginning -- this is not part of the clauses about conditions! see http://www.i-claudius.com/esl/condition.html if you get confused; people always get confused about these when they get short on sleep; I've taught this so never get confused } => "Note: At this point we can explain the reason for only allowing well-formed weights. If ill-formed weights were allowed, the ordering of elements might be incorrectly reflected in the sort key. For example, suppose the secondary weights of the Latin characters were zero (ignorable) and that (as normal) the primary weights of case-variants were equal: that is, a1 = A1. Then the following incorrect keys would be generated:" * * * Appendix A.3.3 3rd par 1rst sentence "If such a modified sort comparison is used, for example, then it forces Quick sort to get the same results as a Merge sort" { Comment: trivial nit-picking again: for parallelism, you either need to make both nouns indefinite -- with article 'a' -- or leave off the article in both cases since syntactially these should be parallels } => "If such a modified sort comparison is used, for example, then it forces a Quick sort to get the same results as a Merge sort." * * * OTHER COMMENTS ON LANGUAGE 3.7 Par 1 Item 2 "2. All Level N weights in Level N-1 ignorables must be strictly less than all weights in Level N-2 ignorables. For example, secondaries in non-ignorables must be strictly less than those in primary ignorables: " { COMMENT: "secondaries in non-ignorables"?? this is fine; however just a comment: no outsider/novice can skip to one section of this document and make any sense of it at all } } * * * Best, -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report) .