L2/06-362 Source: Martin Duerst Date: 2006-10-23 To: unicore@unicode.org Subject: IDNA Dear Unicorers, Below please find a copy of my comments on draft-alvestrand-idna-bidi-00.txt that I sent directly to the authors. I totally agree with Mati that the solution goes too far; indeed I found some examples that produce positively bad results. I think Mati's proposed text looks good; it may be possible to be a bit more liberal, but any such step would need extremely careful scrutinity and lots of examples. Regards, Martin. >Date: Tue, 24 Oct 2006 14:31:01 +0900 >To: Harald Tveit Alvestrand , ck@nrm.museum >From: Martin Duerst >Subject: draft-alvestrand-idna-bidi-00.txt >Cc: suignard > >Hello Harald, Cary, > >Here are some comments on draft-alvestrand-idna-bidi-00.txt. >Is there a mailing list for this? If yes, please tell me and >forward these comments there. Comments are labeled by >>>. > > > An IDNA problem in right-to-left scripts > draft-alvestrand-idna-bidi-00 > >Abstract > > The use of right-to-left scripts in internationalized domain names > has presented several challenges. This memo discusses one problem > resulting from a constraint on the use of combining characters at the > end of an RTL domain label, resulting in some words being declared > invalid as IDN labels, and proposes a means for ameliorating this > problem. > > >1. Introduction and problem description > > The IDNA specification "Stringprep", [RFC3454] makes the following > statement in its section 6 on the bidi algorithm, : > > 3) If a string contains any RandALCat character, a RandALCat > character MUST be the first character of the string, and a > RandALCat character MUST be the last character of the string. > > (A RandAlCat character is a character with unambiguously right-to- > left directionality.) > > The reasoning behind this prohibition is to ensure that every > component of a visually presented domain name has an unambiguously > preferred direction. > >>>> Not exactly, although that's a way to explain it. One even more >>>> important (because more confusing if not attained) purpose is >>>> to make sure that the characters of each label stay together, >>>> and the dots clearly visually separate the labels. > > However, this makes certain words in languages > written with right-to-left scripts invalid as IDN labels, and in at > least one case means that all the words of an entire language are > forbidden as IDN labels. > > This will be illustrated below with examples taken from the Dhivehi > and Yiddish languages, as written with the Thaana and Hebrew scripts, > respectively. > >>>> I think this is a very valuable observation, and shows that there >>>> was an oversight in the original design. BTW, exactly the same >>>> restrictions also apply to IRI components (see RFC 3987 >>>> (http://www.ietf.org/rfc/rfc3987.txt), Section >>>> 4.2), although they are SHOULDs there, rather than MUSTs (mainly >>>> to take into account the fact that URIs/IRIs also serve to carry >>>> data (especially in the query part), where a MUST restriction >>>> might be fatal to functionality. >>>> [I have cc'ed Michel Suignard, co-author of RFC 3987, because >>>> I think that eventually, we should coordinate our changes and >>>> integrate them into an update of RFC 3987.] > > The problem may be addressed by more carefully considering the bidi > algorithm in Unicode Standard Annex #9 [UAX9] which states in section > 3.3.3 W1: "Examine each non-spacing mark (NSM) in the level run, and > change the type of the NSM to the type of the previous character." > ("Previous" as used here refers to the sequence of Unicode characters > in a data stream, and is not related to the positions of the > characters when displayed.) > >>>> This is a very good observation. Who's idea was it? > > A note on terminology: > > In this memo, we use "network order" to describe the sequence of > characters as transmitted on the wire or stored in a file; the terms > "first", "next" and "previous" are used to refer to the relationship > of characters in network order. > > We use "display order" to talk about the sequence of characters as > imaged on a display medium; the terms "left" and "right" are used to > refer to the relationship of characters in display order. > >>>> This explanation should come earlier, then you wouldn't need >>>> the sentence explaining "previous" two paragraphs earlier. > > >2. Detailed examples > > >2.2. Yiddish > > The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode > database is NSM, which again causes the IDNA algorithm to reject the > string. (It may also be noted that the requisite combined characters > also exist in precomposed form at separate positions in the Unicode > chart. However, Stringprep also rejects those codepoints, for > reasons not discussed here.) > >>>> This was mostly done because of feedback from Israel, which said >>>> that allowing Hebrew vowels would cause problems because: >>>> 1) Not supported in many applications/encodings >>>> 2) Creates multiple variants of what's essentially the same thing >>>> (one without vowel marks, some with vowel marks), which may >>>> increase registration costs or lead to spoofing. >>>> It's probably difficult to strike a balance between Hebrew and >>>> Yiddish. > >>>> If possible, I suggest also finding some Arabic examples, or >>>> explain why you think there is no problem for Arabic. > > >3. Modification to RFC 3454 > > If the following modification is made to RFC 3454, we believe that > the usefulness of the specification for languages written with right- > to-left scripts will be significantly improved: > > Old text: > > [Unicode3.2] defines several bidirectional categories; each > character has one bidirectional category assigned to it. For the > purposes of the requirements below, an "RandALCat character" is a > character that has Unicode bidirectional categories "R" or "AL"; > an "LCat character" is a character that has Unicode bidirectional > category "L". > > New text: > > [Unicode3.2] defines several bidirectional categories; each > character has one bidirectional category assigned to it. > > For characters that have category "R", "AL" or "L", the category > is fixed (UAX#9 defines them as having "strong" category); for > characters in category EN, ES, ET, AN, CS, NSM, BN, B, S, WS and > ON, the category is determined by applying the algorithm described > in UAX#9 section 3.3 to the string. > > For the purposes of the requirements below, an "RandALCat > character" is a character that, after this determination, has > Unicode bidirectional categories "R" or "AL"; an "LCat character" > is a character that has Unicode bidirectional category "L". > >>>> I think the details here have to be examined very carefully. >>>> Saying that NSM following R or AL is included in RandALCat makes >>>> a lot of sense. But allowing numbers (EN, AN) near the boundary >>>> will mean that weird things will start to happen. In particular, >>>> it will mean that when displaying a full domain name (several >>>> labels and their separating dots), parts of labels can get >>>> separated. Consider the following example (upper case is RTL): >>>> >>>> logical/network order: display (LTR context): >>>> abc.def3.4GHI.JKL.mno.pqr abc.def3.4LKJ.IHG.mno.pqr >>>> >>>> The '4' above suddenly ends up on the wrong label. Similar >>>> problems will happen with RTL display contexts. >>>> So while I agree with the general direction, the details have to >>>> be considered much more carefully, with a lot of examples, and >>>> with people who have a lot of experience. > > Note that Unicode 5.0 is the current version of Unicode. This fix > refers to Unicode 3.2 only, to maintain consistency with the rest of > RFC 3454; nothing here should affect the relationship between Unicode > versions and IDNA. > >>>> It is unclear whether this means "we need to do this on Unicode 3.2 >>>> because IDNA is Unicode 3.2" or "this is independent of Unicode >>>> version, and IDNA may at some point be upgraded, but because it is >>>> currently Unicode 3.2, our wording is for Unicode 3.2 also". >>>> I hope it's the later, and I hope it can be made clearer. > > Also, as noted in the introduction, the Unicode UAX#9 algorithm is > quite complex. For the purposes of IDNA, a simpler algorithm may be > defind that yields the same result within the constraints of this > context, but may be easier for people to implement consistently. > Such an algorithm may be included in later versions of this memo. > >>>> defind -> defined > >4. Other issues in need of resolution > > Another set of issues concerns the proper display of IDNs with a > mixture of LTR and RTL labels, or only RTL labels; it is not clear to > these authors what the proper display order of the components of a > domain name are if the directiion of the components (in network > order) is, for instance, FirstRTL.SecondRTL.LTR - is it > LTRtsriF.LTRdnoceS.LTR or LTRdnoceS.LTRtsrif.LTR? Again, this memo > does not attempt to suggest a solution to this problem. > >>>> RFC 3987 clearly specifies in Section 4.1 that the display should >>>> occur in an LTR context, which means that "LTRdnoceS.LTRtsrif.LTR" >>>> is correct. In an RTL context, one would get "LTR.LTRtsriF.LTRdnoceS". >>>> "LTRtsriF.LTRdnoceS.LTR" could only be obtained with the Unicode >>>> algorithm if additional control characters would be inserted. >>>> I practice, I'm expecting that RTL context (i.e. "LTRtsriF.LTRdnoceS.LTR") >>>> will also be somewhat be used, but that the distinction usually >>>> will be possible from context, mainly because the set of customary >>>> labels for TLDs and for "Bottom-Level Domain" (e.g. www, ftp,...) >>>> is disjoint. >>>> [The use of LTR/RTL in the example itself is confusing, as you >>>> may have observed yourself, because on display, the distinction is lost.] > >5. Backwards compatibility considerations > > As with any change to an existing standard, it is important to > consider what happens with existing implementations when the change > is introduced. The following troublesome cases have been noted: > > o Old program used to input the newly allowed string. If the old > program checks the input against RFC 3454, the string will not be > allowed, and that domain name will remain inaccessible. > > o Old program is asked to display the newly allowed string, and > checks it against RFC 3454 before displaying. The program will > perform some kind of fallback, most likely displaying the Punycode > form of the string. > >>>> To my knowledge, nameprep and similar things are still rather >>>> spottily or not at all implemented in browsers that support IDN. >>>> But that may have changed, or may change soon. > > o Old program tries to display the newly allowed string. If the old > program has code for displaying the last character of a string > that is different from the code used to display the characters in > the middle of the string, display may be inconsistent and cause > confusion. > >>>> This is a rather contorted case. It could show up as the result >>>> of a weird bug, but I wouldn't expect it otherwise. > > One particular example of the last case is if a program chooses to > examine the last character (in network order) of a string in order to > determine its directionality, rather than its first; if it finds an > NSM character and tries to display the string as if it was a left-to- > right string, the resulting display may be interesting, but not > useful. > >>>> Have you actually seen such a thing in the wild? It would work under >>>> the assumption that in order to determine how to display a label, >>>> only the overall directionality of the label is relevant. This is >>>> utterly wrong. A label ABC123DEF (upper case is RTL) has to be >>>> displayed as FED123CBA. I suggest to scrap the third bullet >>>> above, or to clearly explain that this would be a wrong shortcut anyway. > > The authors believe that these cases will have less harmful impact in > practice than continuing to deny the use of words from the languages > for which these strings are necessary as IDN labels. > >>>> I believe so too, if the details can be worked out. > >7. Security Considerations > > This modification will allow some strings to be used in Stringprep > contexts that are not allowed today. It is possible that differences > in the interpretation of the specification between old and new > implementations could pose a security risk, but it is difficult to > envision any specific instantiation of this. > > Any rational attempt to compute, for instance, a hash over an > identifier processed by stringprep would use network order for its > computation, and thus be unaffected by the changes proposed here. > > While it is not believed to pose a problem, if display routines had > been written with specific knowledge of the current Stringprep > prohibitions, it is possible that the possible problems noted under > "backwards compatibility" could cause new kinds of confusion. > >>>> I agree that this should not cause a problem. > >8. Acknowledgements > > While the listed editors held the pen, this document represents the > joint work and conclusions of an ad hoc design team. In addition to > the editors this consisted of, in alphabetic order, Tina Dam, Patrik > Faltstrom, and John Klensin. Many further specific contributions and > helpful comments were received from the people listed below, and > others who have contributed to the development and use of the IDNA > protocols. > > The team wishes in particular to thank Roozbeh Pournader for calling > its attention to the issue with the Thaana script, and Paul Hoffmann > for pointing out the need to be explicit about backwards > compatibility considerations. > >9. References > > [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of > Internationalized Strings ("stringprep")", RFC 3454, > December 2002. > > [UAX9] 0, "Unicode Standard Annex #9: The Bidirectional > Algorithm, revision 15", 03 2005. > >>>> The "0" looks really weird. Same for the date. I suggest: >>>>[UAX9] The Unicode Consortium, "Unicode Standard Annex #9: >>>> The Bidirectional Algorithm, revision 15", March 2005. > > > >Authors' Addresses > > Harald Tveit Alvestrand (editor) > Google > Beddingen 10 > Trondheim, 7014 > Norway > >It would really be good to have an email address, so that even >people who don't know you address by chance can comment. > > >Regards, Martin. > >#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University >#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp