L2/06-362

Source: Martin Duerst
Date: 2006-10-23
To: unicore@unicode.org
Subject: IDNA

Dear Unicorers,

Below please find a copy of my comments on draft-alvestrand-idna-bidi-00.txt
that I sent directly to the authors.

I totally agree with Mati that the solution goes too far; indeed I found
some examples that produce positively bad results. I think Mati's
proposed text looks good; it may be possible to be a bit more liberal,
but any such step would need extremely careful scrutinity and lots
of examples.

Regards,    Martin.

>Date: Tue, 24 Oct 2006 14:31:01 +0900
>To: Harald Tveit Alvestrand <harald@alvestrand.no>, ck@nrm.museum
>From: Martin Duerst <duerst@it.aoyama.ac.jp>
>Subject: draft-alvestrand-idna-bidi-00.txt
>Cc: suignard
>
>Hello Harald, Cary,
>
>Here are some comments on draft-alvestrand-idna-bidi-00.txt.
>Is there a mailing list for this? If yes, please tell me and
>forward these comments there. Comments are labeled by >>>.
>
>
>                An IDNA problem in right-to-left scripts
>                     draft-alvestrand-idna-bidi-00
>
>Abstract
>
>   The use of right-to-left scripts in internationalized domain names
>   has presented several challenges.  This memo discusses one problem
>   resulting from a constraint on the use of combining characters at the
>   end of an RTL domain label, resulting in some words being declared
>   invalid as IDN labels, and proposes a means for ameliorating this
>   problem.
>
>
>1.  Introduction and problem description
>
>   The IDNA specification "Stringprep", [RFC3454] makes the following
>   statement in its section 6 on the bidi algorithm, :
>
>      3) If a string contains any RandALCat character, a RandALCat
>      character MUST be the first character of the string, and a
>      RandALCat character MUST be the last character of the string.
>
>   (A RandAlCat character is a character with unambiguously right-to-
>   left directionality.)
>
>   The reasoning behind this prohibition is to ensure that every
>   component of a visually presented domain name has an unambiguously
>   preferred direction.
>
>>>> Not exactly, although that's a way to explain it. One even more
>>>> important (because more confusing if not attained) purpose is
>>>> to make sure that the characters of each label stay together,
>>>> and the dots clearly visually separate the labels.
>
>                         However, this makes certain words in languages
>   written with right-to-left scripts invalid as IDN labels, and in at
>   least one case means that all the words of an entire language are
>   forbidden as IDN labels.
>
>   This will be illustrated below with examples taken from the Dhivehi
>   and Yiddish languages, as written with the Thaana and Hebrew scripts,
>   respectively.
>
>>>> I think this is a very valuable observation, and shows that there
>>>> was an oversight in the original design. BTW, exactly the same
>>>> restrictions also apply to IRI components (see RFC 3987
>>>> (http://www.ietf.org/rfc/rfc3987.txt), Section
>>>> 4.2), although they are SHOULDs there, rather than MUSTs (mainly
>>>> to take into account the fact that URIs/IRIs also serve to carry
>>>> data (especially in the query part), where a MUST restriction
>>>> might be fatal to functionality.
>>>> [I have cc'ed Michel Suignard, co-author of RFC 3987, because
>>>>  I think that eventually, we should coordinate our changes and
>>>>  integrate them into an update of RFC 3987.]
>
>   The problem may be addressed by more carefully considering the bidi
>   algorithm in Unicode Standard Annex #9 [UAX9] which states in section
>   3.3.3 W1: "Examine each non-spacing mark (NSM) in the level run, and
>   change the type of the NSM to the type of the previous character."
>   ("Previous" as used here refers to the sequence of Unicode characters
>   in a data stream, and is not related to the positions of the
>   characters when displayed.)
>
>>>> This is a very good observation. Who's idea was it?
>
>   A note on terminology:
>
>   In this memo, we use "network order" to describe the sequence of
>   characters as transmitted on the wire or stored in a file; the terms
>   "first", "next" and "previous" are used to refer to the relationship
>   of characters in network order.
>
>   We use "display order" to talk about the sequence of characters as
>   imaged on a display medium; the terms "left" and "right" are used to
>   refer to the relationship of characters in display order.
>
>>>> This explanation should come earlier, then you wouldn't need
>>>> the sentence explaining "previous" two paragraphs earlier.
>
>
>2.  Detailed examples
>
>
>2.2.  Yiddish
>
>   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
>   database is NSM, which again causes the IDNA algorithm to reject the
>   string.  (It may also be noted that the requisite combined characters
>   also exist in precomposed form at separate positions in the Unicode
>   chart.  However, Stringprep also rejects those codepoints, for
>   reasons not discussed here.)
>
>>>> This was mostly done because of feedback from Israel, which said
>>>> that allowing Hebrew vowels would cause problems because:
>>>> 1) Not supported in many applications/encodings
>>>> 2) Creates multiple variants of what's essentially the same thing
>>>>    (one without vowel marks, some with vowel marks), which may
>>>>    increase registration costs or lead to spoofing.
>>>> It's probably difficult to strike a balance between Hebrew and
>>>> Yiddish.
>
>>>> If possible, I suggest also finding some Arabic examples, or
>>>> explain why you think there is no problem for Arabic.
>
>
>3.  Modification to RFC 3454
>
>   If the following modification is made to RFC 3454, we believe that
>   the usefulness of the specification for languages written with right-
>   to-left scripts will be significantly improved:
>
>   Old text:
>
>      [Unicode3.2] defines several bidirectional categories; each
>      character has one bidirectional category assigned to it.  For the
>      purposes of the requirements below, an "RandALCat character" is a
>      character that has Unicode bidirectional categories "R" or "AL";
>      an "LCat character" is a character that has Unicode bidirectional
>      category "L".
>
>   New text:
>
>      [Unicode3.2] defines several bidirectional categories; each
>      character has one bidirectional category assigned to it.
>
>      For characters that have category "R", "AL" or "L", the category
>      is fixed (UAX#9 defines them as having "strong" category); for
>      characters in category EN, ES, ET, AN, CS, NSM, BN, B, S, WS and
>      ON, the category is determined by applying the algorithm described
>      in UAX#9 section 3.3 to the string.
>
>      For the purposes of the requirements below, an "RandALCat
>      character" is a character that, after this determination, has
>      Unicode bidirectional categories "R" or "AL"; an "LCat character"
>      is a character that has Unicode bidirectional category "L".
>
>>>> I think the details here have to be examined very carefully.
>>>> Saying that NSM following R or AL is included in RandALCat makes
>>>> a lot of sense. But allowing numbers (EN, AN) near the boundary
>>>> will mean that weird things will start to happen. In particular,
>>>> it will mean that when displaying a full domain name (several
>>>> labels and their separating dots), parts of labels can get
>>>> separated. Consider the following example (upper case is RTL):
>>>>
>>>> logical/network order:            display (LTR context):
>>>> abc.def3.4GHI.JKL.mno.pqr         abc.def3.4LKJ.IHG.mno.pqr
>>>>
>>>> The '4' above suddenly ends up on the wrong label. Similar
>>>> problems will happen with RTL display contexts.
>>>> So while I agree with the general direction, the details have to
>>>> be considered much more carefully, with a lot of examples, and
>>>> with people who have a lot of experience.
>
>   Note that Unicode 5.0 is the current version of Unicode.  This fix
>   refers to Unicode 3.2 only, to maintain consistency with the rest of
>   RFC 3454; nothing here should affect the relationship between Unicode
>   versions and IDNA.
>
>>>> It is unclear whether this means "we need to do this on Unicode 3.2
>>>> because IDNA is Unicode 3.2" or "this is independent of Unicode
>>>> version, and IDNA may at some point be upgraded, but because it is
>>>> currently Unicode 3.2, our wording is for Unicode 3.2 also".
>>>> I hope it's the later, and I hope it can be made clearer.
>
>   Also, as noted in the introduction, the Unicode UAX#9 algorithm is
>   quite complex.  For the purposes of IDNA, a simpler algorithm may be
>   defind that yields the same result within the constraints of this
>   context, but may be easier for people to implement consistently.
>   Such an algorithm may be included in later versions of this memo.
>
>>>> defind -> defined
>
>4.  Other issues in need of resolution
>
>   Another set of issues concerns the proper display of IDNs with a
>   mixture of LTR and RTL labels, or only RTL labels; it is not clear to
>   these authors what the proper display order of the components of a
>   domain name are if the directiion of the components (in network
>   order) is, for instance, FirstRTL.SecondRTL.LTR - is it
>   LTRtsriF.LTRdnoceS.LTR or LTRdnoceS.LTRtsrif.LTR?  Again, this memo
>   does not attempt to suggest a solution to this problem.
>
>>>> RFC 3987 clearly specifies in Section 4.1 that the display should
>>>> occur in an LTR context, which means that "LTRdnoceS.LTRtsrif.LTR"
>>>> is correct. In an RTL context, one would get "LTR.LTRtsriF.LTRdnoceS".
>>>> "LTRtsriF.LTRdnoceS.LTR" could only be obtained with the Unicode
>>>> algorithm if additional control characters would be inserted.
>>>> I practice, I'm expecting that RTL context (i.e. "LTRtsriF.LTRdnoceS.LTR")
>>>> will also be somewhat be used, but that the distinction usually
>>>> will be possible from context, mainly because the set of customary
>>>> labels for TLDs and for "Bottom-Level Domain" (e.g. www, ftp,...)
>>>> is disjoint.
>>>> [The use of LTR/RTL in the example itself is confusing, as you
>>>> may have observed yourself, because on display, the distinction is lost.]
>
>5.  Backwards compatibility considerations
>
>   As with any change to an existing standard, it is important to
>   consider what happens with existing implementations when the change
>   is introduced.  The following troublesome cases have been noted:
>
>   o  Old program used to input the newly allowed string.  If the old
>      program checks the input against RFC 3454, the string will not be
>      allowed, and that domain name will remain inaccessible.
>
>   o  Old program is asked to display the newly allowed string, and
>      checks it against RFC 3454 before displaying.  The program will
>      perform some kind of fallback, most likely displaying the Punycode
>      form of the string.
>
>>>> To my knowledge, nameprep and similar things are still rather
>>>> spottily or not at all implemented in browsers that support IDN.
>>>> But that may have changed, or may change soon.
>
>   o  Old program tries to display the newly allowed string.  If the old
>      program has code for displaying the last character of a string
>      that is different from the code used to display the characters in
>      the middle of the string, display may be inconsistent and cause
>      confusion.
>
>>>> This is a rather contorted case. It could show up as the result
>>>> of a weird bug, but I wouldn't expect it otherwise.
>
>   One particular example of the last case is if a program chooses to
>   examine the last character (in network order) of a string in order to
>   determine its directionality, rather than its first; if it finds an
>   NSM character and tries to display the string as if it was a left-to-
>   right string, the resulting display may be interesting, but not
>   useful.
>
>>>> Have you actually seen such a thing in the wild? It would work under
>>>> the assumption that in order to determine how to display a label,
>>>> only the overall directionality of the label is relevant. This is
>>>> utterly wrong. A label ABC123DEF (upper case is RTL) has to be
>>>> displayed as FED123CBA. I suggest to scrap the third bullet
>>>> above, or to clearly explain that this would be a wrong shortcut anyway.
>
>   The authors believe that these cases will have less harmful impact in
>   practice than continuing to deny the use of words from the languages
>   for which these strings are necessary as IDN labels.
>
>>>> I believe so too, if the details can be worked out.
>
>7.  Security Considerations
>
>   This modification will allow some strings to be used in Stringprep
>   contexts that are not allowed today.  It is possible that differences
>   in the interpretation of the specification between old and new
>   implementations could pose a security risk, but it is difficult to
>   envision any specific instantiation of this.
>
>   Any rational attempt to compute, for instance, a hash over an
>   identifier processed by stringprep would use network order for its
>   computation, and thus be unaffected by the changes proposed here.
>
>   While it is not believed to pose a problem, if display routines had
>   been written with specific knowledge of the current Stringprep
>   prohibitions, it is possible that the possible problems noted under
>   "backwards compatibility" could cause new kinds of confusion.
>
>>>> I agree that this should not cause a problem.
>
>8.  Acknowledgements
>
>   While the listed editors held the pen, this document represents the
>   joint work and conclusions of an ad hoc design team.  In addition to
>   the editors this consisted of, in alphabetic order, Tina Dam, Patrik
>   Faltstrom, and John Klensin.  Many further specific contributions and
>   helpful comments were received from the people listed below, and
>   others who have contributed to the development and use of the IDNA
>   protocols.
>
>   The team wishes in particular to thank Roozbeh Pournader for calling
>   its attention to the issue with the Thaana script, and Paul Hoffmann
>   for pointing out the need to be explicit about backwards
>   compatibility considerations.
>
>9.  References
>
>   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
>              Internationalized Strings ("stringprep")", RFC 3454,
>              December 2002.
>
>   [UAX9]     0, "Unicode Standard Annex #9: The Bidirectional
>              Algorithm, revision 15", 03 2005.
>
>>>> The "0" looks really weird. Same for the date. I suggest:
>>>>[UAX9]     The Unicode Consortium, "Unicode Standard Annex #9:
>>>>           The Bidirectional Algorithm, revision 15", March 2005.
>
>
>
>Authors' Addresses
>
>   Harald Tveit Alvestrand (editor)
>   Google
>   Beddingen 10
>   Trondheim,   7014
>   Norway
>
>It would really be good to have an email address, so that even
>people who don't know you address by chance can comment.
>
>
>Regards,     Martin.
>
>#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp