L2/10-156

Author:  C. E. Whitehead
Subject: Comments on UTR #36
Date:    March 27, 2010

// Note:
// These comments have already been reviewed by the author of UTR #36.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Date/Time:    Sat Mar 27 18:37:28 CST 2010
Contact:      cewcathar@hotmail.com
Name:         C. E. Whitehead
Report Type:  Other Question, Problem, or Feedback
Opt Subject:  tr36 proposed (hopefully the most recent draft)

Here are comments on:

http://www.unicode.org/reports/tr36/proposed.html
http://www.unicode.org/draft/reports/tr36/tr36.html

{ Actually this was not on the list of public review issues but I found it among the reports and it was of interest to me and it seemed it was currently undergoing revision (+ it read faster than tr10);
sorry if I misjudged;
also I hope I got the most up-to-date version of this read properly }


2.5 Bidirectional Text Spoofing par 4

"In addition, the IRI specification extends those requirements to other components of an IRI, not just the host name labels. Not respecting them would result in insurmountable visual confusion. A large part of the confusability in reading an IRI containing bidi characters is created by the weak or neutral directionality property of many IRI/URI delimiters such as '/', '.', '?' which makes them change directionality depending on their surrounding characters. For example, in example #1 in the table below, the dots following each label are colored the same as that label. Notice that the placement of that following punctuation may vary."

{ COMMENT I do not believe that the '/' will change the direction it faces as a result of bidi;
thus it won't be confused with '\'
but yes it's placement can change
and of course the direction that other characters face can be changed --
the latter would allow more spoofing }

* * *
2.5 Bidi Examples
{ COMMENT:  hmm Do you want to add the following --
right after the table: }
=>
"which thus can be confused with:
http://دائم سلام .com  "

{ The above makes the example clearer for me;
hope I understood. }

* * *
{ COMMENTS on Bidi Examples CONTINUED:

See also:

http://196.200.140.8/Tests/Bidi-Fev-2010/bidiLinkText.html
and
http://lists.w3.org/Archives/Public/public-i18n-bidi/2010JanMar/0026.html

(if you have not looked at these) }
* * *
2.5.1

"In such cases, two characters may be visually distinct in a stand-alone form, but might not be distinct in a particular context."

{ COMMENT:  I do not see a real problem with Arabic this way; the characters have particular shapes and those with similar shapes change their shapes in similar ways;
there is only one character the yaa which flattens quite a bit in between other consonants
in some representations;
however, actually, my IE8 browser carefully shows each of the yaa's in a sequence as an individual character! 
(It might be best to put this out to other people on the list to see if this is an issue in other browsers
because I have IE8 and occasionally see stuff in IE7 or Mozilla but not often;
in certain contexts the hah looks a bit like the tah marbutah; that's it I think -- for Arabic itself;
there are other languages that use this script) }
* * *

PROOFREADING

2.6.1 Missing Glyphs 1rst par

"It is very important not to show a missing glyph or character with a simple "?", 
since that makes every such character be visually confusable with a real question mark. "

{ COMMENT:  'makes . . . be' sounds colloquial;
you can simply use 'makes' followed by an NP in the oblique case, followed by an adjective, without 'be;' 
I would omit 'be.' }
=>
"It is very important not to show a missing glyph or character with a simple "?", since that makes every such character visually confusable with a real question mark."

* * * * * *

3.1 - 3.7 MORE PROOFREADING; A FEW COMMENTS ON CONTENT

* * *
3.1.2 last par
"UTF-16 converters that don't handle isolated surrogates correctly are subject to the same type of attack, although historically UTF-16 converters have had generally handled these well."
{COMMENT:  "have had" not needed; just "have" WHAT? }

> >=
"UTF-16 converters that don't handle isolated surrogates correctly are subject to the same type of attack, although historically UTF-16 converters have generally handled these well."

3.2, 4th par

"For example, a fundamental standard, LDAP, is subject to this problem; thus steps --- taken to remedy this [LDAP]."

{ COMMENT:  your latest draft shows the verb "were" crossed out; 
you need to reinsert "were"
or another verb -- such as "have been" -- that is you need a verb here }

* * *
3.3 1rst par Item 2 Bullet 2
"In Unicode 5.0, a new Stream-Safe Text Format is has been added to UAX#15: Unicode Normalization Forms [UAX15]. This format allows protocols to limit the number of characters that they need to buffer in handling normalization."
{COMMENT: delete "is" here}

=>
"◦In Unicode 5.0, a new Stream-Safe Text Format has been added to UAX#15: Unicode Normalization Forms [UAX15]. This format allows protocols to limit the number of characters that they need to buffer in handling normalization."
* * *

3.4 1rst par

"The Unicode Consortium Stability Policies [Stability] limits . . . "
{ COMMENT:  since you've inserted the plural form "Policies" you now need to change "limits" to "limit" }

=> 
"The Unicode Consortium Stability Policies [Stability] limit . . .  "


* * *
3.4 last par

"An implementation may need to make certain assumptions for performance — 
ones that are not guaranteed by the policies. 
In such a case, 
it is recommended to at least have unit tests that detect 
whether those assumptions have become invalid when 
the implementation is upgraded to a new version of Unicode. 
That allows the code to be revised 
if that were to happen."

{ COMMENT : repetitive, redundant with "if that were to happen" on the end -- 
"If" is understood -- that is we understand that we are talking about something hypothetical
whenever you have "In such a case"
-- and so the second "If" with the subjunctive actually sounds 'wishy washy' here;
you don't need to keep saying 'if;' we know you mean 'if.' 
Also I tend to prefer the plural form, "In such cases" or "For such cases," 
to describe a possibility
that something might happen -- that is I tend to like to focus on multiple possibilities 
(this is a personal choice however);
also finally
"That" -- in the last sentence -- refers to something less immediate and more remote;
I prefer "This" }
=>"An implementation may need to make certain assumptions for performance — 
ones that are not guaranteed by the policies. 
For such cases, 
it is recommended to at least have unit tests that detect 
whether those assumptions have become invalid when 
the implementation is upgraded to a new version of Unicode. 
This allows the code to be revised 
when this happens."

{ COMMENT2:  Alternately, I might repeat "in these cases" ( well I've varied it slightly; it was
"For such cases" ) 
here instead of saying 
"when this happens" }

* * *
3.5 Deletion of Code Points 1rst par

"C7. When a process purports not to modify 
the interpretation of a valid coded character sequence, 
it shall make no change to that coded character sequence other than 
the possible replacement of character sequences 
by their canonical-equivalent sequences 
or the deletion of noncharacter code points"

{ COMMENT:  o.k. I guess; I prefer an -ing form after "no change to" -- and -ing forms can act like nouns you know 
and in this sentence using -ing forms saves words --
so I'd say "other than possibly replacing character sequences . . . 
or deleting noncharacter code points" }

=>
"C7. When a process purports not to modify 
the interpretation of a valid coded character sequence, 
it shall make no change to that coded character sequence other than 
possibly replacing character sequences 
by their canonical-equivalent sequences 
or deleting noncharacter code points"

{
* COMMENT ON CONTENT  -- 

I personally might like to know when noncharacter code points were deleted myself
-- but y'all deleted the warning here

which is fine;

the typical problem example is formatting characters -- which can change directionality
converting a string that appears to be one thing into an entirely different domain name
you've already given examples of this in 2.5 above; 
you could add a reference to these here. }

* * *

3.5 Deletion of Code Points continued par 3

"Whenever a character is invisibly deleted (instead of replaced), it may cause a security problem. "

{ COMMENT:  now that you've cut out the intervening text, you need a transition -- 
such as "Nevertheless"  -- since you've changed your focus! }

=>
"Nevertheless, whenever a character is invisibly deleted (instead of replaced), 
it may cause a security problem. "

* * *
3.6.2 par 1

"Similar to the considerations in 3.5 Deletion of Noncharacters, 
character encoding conversion must also not simply skip an illegal input byte sequence 
but rather stop with an error or substitute a Replacement Character 
or an escape sequence etc. in the output. 
It is important to do this not only for byte sequences that encode characters, but also for unrecognized or "empty" state-change sequences. "

{ COMMENT:  'etc'?  I use etc. all the time, because it's quick and concise, 
but what does etc. mean?
If you can say briefly what it refers to, in three or four noun phrases,
that would be better.  }

* * *

3.7 Par 1 3rd bullet

"These problems come up in other situations besides file systems as well. A common source is when a byte string that is valid in one charset is converted by a different charset's converter. For example, the byte string <E0 30> that is invalid in SJIS is perfectly meaningful in Latin-1, representing "à0". "

{ COMMENT:  this should not be a bullet.  It belongs below the two bulleted items.  
It contains a comment on them; and besided you've only prepared your reader
for two items in this list. }

* * *
3.7 last par

"A possible solution to this is to enable all charset converters to losslessly (reversibly) 
convert to Unicode. That is, any sequence of bytes can be converted 
by each charset converter to a Unicode string, 
and that Unicode string will be converted back to 
exactly that original sequence of bytes by that converter. 
This precludes, for example, the charset converter's mapping 
two different unmappable byte sequences to U+FFFD ( � ) 
REPLACEMENT CHARACTER, 
since the original bytes could not be recovered. 
It also precludes having "fallbacks" (see http://unicode.org/reports/tr22/): 
cases where two different byte sequences map to the same Unicode sequence. "

{ COMMENT: verb tense issue -- 
"could" in "could not be recovered" is the past tense
of the modal can;
you usually use "can" with a sentence in the present; 

COMMENT ON CONTENT:  
what if the byte sequences that are different are visually confusable?
Maybe I don't understand . . .
It seems that maybe something like PEP 383 can handle security issues that arise . . . 
I hope I understand this }

* * *

3.7.2 3rd bullet

"an 40"

{ COMMENT: ? 40 begins with a consonat sound }

=>?  "a 40"

* * *

Best,

C. E. Whitehead
cewcathar@hotmail.com

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --