L2/10-156 Author: C. E. Whitehead Subject: Comments on UTR #36 Date: March 27, 2010 // Note: // These comments have already been reviewed by the author of UTR #36. -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Sat Mar 27 18:37:28 CST 2010 Contact: cewcathar@hotmail.com Name: C. E. Whitehead Report Type: Other Question, Problem, or Feedback Opt Subject: tr36 proposed (hopefully the most recent draft) Here are comments on: http://www.unicode.org/reports/tr36/proposed.html http://www.unicode.org/draft/reports/tr36/tr36.html { Actually this was not on the list of public review issues but I found it among the reports and it was of interest to me and it seemed it was currently undergoing revision (+ it read faster than tr10); sorry if I misjudged; also I hope I got the most up-to-date version of this read properly } 2.5 Bidirectional Text Spoofing par 4 "In addition, the IRI specification extends those requirements to other components of an IRI, not just the host name labels. Not respecting them would result in insurmountable visual confusion. A large part of the confusability in reading an IRI containing bidi characters is created by the weak or neutral directionality property of many IRI/URI delimiters such as '/', '.', '?' which makes them change directionality depending on their surrounding characters. For example, in example #1 in the table below, the dots following each label are colored the same as that label. Notice that the placement of that following punctuation may vary." { COMMENT I do not believe that the '/' will change the direction it faces as a result of bidi; thus it won't be confused with '\' but yes it's placement can change and of course the direction that other characters face can be changed -- the latter would allow more spoofing } * * * 2.5 Bidi Examples { COMMENT: hmm Do you want to add the following -- right after the table: } => "which thus can be confused with: http://دائم سلام .com " { The above makes the example clearer for me; hope I understood. } * * * { COMMENTS on Bidi Examples CONTINUED: See also: http://196.200.140.8/Tests/Bidi-Fev-2010/bidiLinkText.html and http://lists.w3.org/Archives/Public/public-i18n-bidi/2010JanMar/0026.html (if you have not looked at these) } * * * 2.5.1 "In such cases, two characters may be visually distinct in a stand-alone form, but might not be distinct in a particular context." { COMMENT: I do not see a real problem with Arabic this way; the characters have particular shapes and those with similar shapes change their shapes in similar ways; there is only one character the yaa which flattens quite a bit in between other consonants in some representations; however, actually, my IE8 browser carefully shows each of the yaa's in a sequence as an individual character! (It might be best to put this out to other people on the list to see if this is an issue in other browsers because I have IE8 and occasionally see stuff in IE7 or Mozilla but not often; in certain contexts the hah looks a bit like the tah marbutah; that's it I think -- for Arabic itself; there are other languages that use this script) } * * * PROOFREADING 2.6.1 Missing Glyphs 1rst par "It is very important not to show a missing glyph or character with a simple "?", since that makes every such character be visually confusable with a real question mark. " { COMMENT: 'makes . . . be' sounds colloquial; you can simply use 'makes' followed by an NP in the oblique case, followed by an adjective, without 'be;' I would omit 'be.' } => "It is very important not to show a missing glyph or character with a simple "?", since that makes every such character visually confusable with a real question mark." * * * * * * 3.1 - 3.7 MORE PROOFREADING; A FEW COMMENTS ON CONTENT * * * 3.1.2 last par "UTF-16 converters that don't handle isolated surrogates correctly are subject to the same type of attack, although historically UTF-16 converters have had generally handled these well." {COMMENT: "have had" not needed; just "have" WHAT? } > >= "UTF-16 converters that don't handle isolated surrogates correctly are subject to the same type of attack, although historically UTF-16 converters have generally handled these well." 3.2, 4th par "For example, a fundamental standard, LDAP, is subject to this problem; thus steps --- taken to remedy this [LDAP]." { COMMENT: your latest draft shows the verb "were" crossed out; you need to reinsert "were" or another verb -- such as "have been" -- that is you need a verb here } * * * 3.3 1rst par Item 2 Bullet 2 "In Unicode 5.0, a new Stream-Safe Text Format is has been added to UAX#15: Unicode Normalization Forms [UAX15]. This format allows protocols to limit the number of characters that they need to buffer in handling normalization." {COMMENT: delete "is" here} => "◦In Unicode 5.0, a new Stream-Safe Text Format has been added to UAX#15: Unicode Normalization Forms [UAX15]. This format allows protocols to limit the number of characters that they need to buffer in handling normalization." * * * 3.4 1rst par "The Unicode Consortium Stability Policies [Stability] limits . . . " { COMMENT: since you've inserted the plural form "Policies" you now need to change "limits" to "limit" } => "The Unicode Consortium Stability Policies [Stability] limit . . . " * * * 3.4 last par "An implementation may need to make certain assumptions for performance — ones that are not guaranteed by the policies. In such a case, it is recommended to at least have unit tests that detect whether those assumptions have become invalid when the implementation is upgraded to a new version of Unicode. That allows the code to be revised if that were to happen." { COMMENT : repetitive, redundant with "if that were to happen" on the end -- "If" is understood -- that is we understand that we are talking about something hypothetical whenever you have "In such a case" -- and so the second "If" with the subjunctive actually sounds 'wishy washy' here; you don't need to keep saying 'if;' we know you mean 'if.' Also I tend to prefer the plural form, "In such cases" or "For such cases," to describe a possibility that something might happen -- that is I tend to like to focus on multiple possibilities (this is a personal choice however); also finally "That" -- in the last sentence -- refers to something less immediate and more remote; I prefer "This" } =>"An implementation may need to make certain assumptions for performance — ones that are not guaranteed by the policies. For such cases, it is recommended to at least have unit tests that detect whether those assumptions have become invalid when the implementation is upgraded to a new version of Unicode. This allows the code to be revised when this happens." { COMMENT2: Alternately, I might repeat "in these cases" ( well I've varied it slightly; it was "For such cases" ) here instead of saying "when this happens" } * * * 3.5 Deletion of Code Points 1rst par "C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points" { COMMENT: o.k. I guess; I prefer an -ing form after "no change to" -- and -ing forms can act like nouns you know and in this sentence using -ing forms saves words -- so I'd say "other than possibly replacing character sequences . . . or deleting noncharacter code points" } => "C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than possibly replacing character sequences by their canonical-equivalent sequences or deleting noncharacter code points" { * COMMENT ON CONTENT -- I personally might like to know when noncharacter code points were deleted myself -- but y'all deleted the warning here which is fine; the typical problem example is formatting characters -- which can change directionality converting a string that appears to be one thing into an entirely different domain name you've already given examples of this in 2.5 above; you could add a reference to these here. } * * * 3.5 Deletion of Code Points continued par 3 "Whenever a character is invisibly deleted (instead of replaced), it may cause a security problem. " { COMMENT: now that you've cut out the intervening text, you need a transition -- such as "Nevertheless" -- since you've changed your focus! } => "Nevertheless, whenever a character is invisibly deleted (instead of replaced), it may cause a security problem. " * * * 3.6.2 par 1 "Similar to the considerations in 3.5 Deletion of Noncharacters, character encoding conversion must also not simply skip an illegal input byte sequence but rather stop with an error or substitute a Replacement Character or an escape sequence etc. in the output. It is important to do this not only for byte sequences that encode characters, but also for unrecognized or "empty" state-change sequences. " { COMMENT: 'etc'? I use etc. all the time, because it's quick and concise, but what does etc. mean? If you can say briefly what it refers to, in three or four noun phrases, that would be better. } * * * 3.7 Par 1 3rd bullet "These problems come up in other situations besides file systems as well. A common source is when a byte string that is valid in one charset is converted by a different charset's converter. For example, the byte string that is invalid in SJIS is perfectly meaningful in Latin-1, representing "à0". " { COMMENT: this should not be a bullet. It belongs below the two bulleted items. It contains a comment on them; and besided you've only prepared your reader for two items in this list. } * * * 3.7 last par "A possible solution to this is to enable all charset converters to losslessly (reversibly) convert to Unicode. That is, any sequence of bytes can be converted by each charset converter to a Unicode string, and that Unicode string will be converted back to exactly that original sequence of bytes by that converter. This precludes, for example, the charset converter's mapping two different unmappable byte sequences to U+FFFD ( � ) REPLACEMENT CHARACTER, since the original bytes could not be recovered. It also precludes having "fallbacks" (see http://unicode.org/reports/tr22/): cases where two different byte sequences map to the same Unicode sequence. " { COMMENT: verb tense issue -- "could" in "could not be recovered" is the past tense of the modal can; you usually use "can" with a sentence in the present; COMMENT ON CONTENT: what if the byte sequences that are different are visually confusable? Maybe I don't understand . . . It seems that maybe something like PEP 383 can handle security issues that arise . . . I hope I understand this } * * * 3.7.2 3rd bullet "an 40" { COMMENT: ? 40 begins with a consonat sound } =>? "a 40" * * * Best, C. E. Whitehead cewcathar@hotmail.com -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --