L2/06-056 Source: Mark Davis Date: Feb 7, 2006 Title: Comments on nextsteps ==== The following are comments on nextsteps. 1.4.2. script A script is a set of graphic characters used for the written form of one or more languages. This definition is the one used in [ISO10646]. Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called ideographs used in writing Chinese, Japanese, and Korean), and Latin (more properly "Roman", see below), Arabic, Greek, and Latin are, of course, also names of languages. Some issues with script identification and relationships with other standards are discussed in [ltru-registry]. >>> >> >> Minor. Based on this note, one would expect to see below why >> "Roman" is "more proper". And it isn't really more proper; the >> two in the past have been essentially treated as equivalent >> when referring to the script. (I remember it well, since at >> Apple I was the one who chose to use the name Roman on the Mac >> when we started using identifiers for scripts.) However, at >> this point in time, by far the most common term used to refer >> to the script in a technical sense is "Latin" -- based on ISO >> usage. Best would be if you replaced "Roman" by "Latin" >> throughout when referring to the script. However, if you >> insist on using "Roman", you should change this line: >> >> > (more properly "Roman", see below), >> => (also known as "Roman"), >> > > > You were right the first time :-) . > It is my understanding that the linguistic community, and, with > some very recent exceptions, the historians of writing systems > have always used "Roman" for the script unless the specific > alphabet used to write (classical) Latin was intended. When I > first learned about typography in the 50s, it was always > "Roman"; indeed there were jokes about Latin requiring only a 21 > or 24 character font, leaving most of the tray empty (20 > alphabetic letters, plus the forms that now appear as U+2180, > plus, depending on the time at which a snapshot is taken, U+2181 > - U+2183). As far as I have been able to tell, the use of > "Latin" was introduced, in error, by X3L2. There were > complaints about that in the 60s but the response was > approximately "none of our terminology is normative in these > coding standards, so don't pay any attention to it". And, from > there, it moved into ISO (or at least TC97 and then JTC1) > terminology. It may be a hopeless attempt to turn back the > tide, but, because of the extreme sensitivity of IDN discussions > to the language/script distinction (note that even "domain name" > is a historical misnomer), it seems useful to return to the > historical terminology whenever possible, leaving "Latin" to > terminology about CCSs as absolutely necessary. As evidence > of that sensitivity, some of those who will read this will > recall that we have already had at least on flap about how, if > the script is called "Arabic", then Arabs should have exclusive > control of how that script is used in the DNS. > > Of course, if the IAB wants either your suggested change or the > long explanation above (or both), I'll be happy to patch them in. > > The advantages or disadvantages of "Roman" vs "Latin" are debatable. "Roman" is easily confused with a typeface term, for example, that contrasts with italic. More to the point is that when a statement is not needed for a particular document, and it's contentious, it simply takes more time to justify it than it is worth. So my recommendation remains; if you're going to use Roman everywhere (despite common practice in the industry) fine; just point out that usage with a phrase like (also known as "Roman") 2.1. User conceptions, local character sets, and input issues People use "words" when they think of things and wish others to think of them too. For example "orange", "tree", "restaurant" or "Acme Inc". Words are normally in a specific language, such as English or Swedish. The DNS, however, supports character-string labels, not "words". While it is useful, especially for mnemonic value or to identify objects, for actual words to be used as DNS labels, other constraints on the DNS make it impossible to guarantee that it will be possible to represent every word in every language as a DNS label, internationalized or not. >>> >> >> Phrasing is a bit odd. And few people thing of "Acme Inc." as >> a word. >> >> It is also unclear what the point is of the sentence: "Words >> are normally in a specific language, such as English or". Is >> it saying that a given word is typically only a word in one >> language (eg "eavesdrop" is, as far as I know, only a word in >> English, while "chat" is a word both in French and in English >> (for different things), and "Finger" is a word in both English >> and German (for the same thing). Unless the point is made >> clearer (and is necessary), it should be dropped to avoid >> confusion. >> >> Suggested replacement: >> > People use ... Swedish. >> => Written and spoken language is usually analysed into a >> sequence of words. For example "orange", "tree", or >> "restaurant". The DNS, however, supports character-string >> labels, not >> > > > As Patrik has explained before, this particular section is > needed to address some real issues that have been repeatedly > raised with IDNs. We've added some material to put it in better > context, as you and others suggested, but, unless others are > concerned enough to convince the IAB, I don't think it is > helpful to try to tune (and possibly weaken) the text further. > > The phrasing is still awkward and misleading. When writing or typing the label (or word), a script must be selected and a charset must be picked for use with that script. That choice of charset is typically not under the control of the user on a per word or per document basis, but may depend on local input devices, keyboard or terminal drivers, or other decisions made by operating system or even hardware designers and implementers. >> >> The phrasing here is still simply wrong. There is no such >> process of selecting a script going on in, for example, >> Windows, and yet it says "must be selected". Suggested >> replacement: >> >> > When writing or typing the label... That choice >> => When text is typed on a computer, it is represented in >> memory by means of a charset. That choice >> > > > You and Patrik have discussed this before. The earlier text has > already been adjusted to bring it closer to your view. At the > risk of putting words in Patrik's mouth, the Windows choices of > keyboards, etc. and, at least through XP, the choice of which > localized version of Windows to install, are precisely about > choices of scripts and, consequently, sets of available > > Actually not; that's my point. Saying that "When writing or typing the label a script must be selected" is confusing three concepts: charset and keyboard and script. Saying "a script must be selected" is *neither* true from the user's perspective, nor does it at all match what I know of as the pipeline from keypress to storage of a label. While there might be some obscure operating system for which this is true, the statement is just flat-out false for most OSs. The proper word, if more substantial changes are not made to the wording, would be ""a keyboard must be selected". (Even that is a quite odd, since it implies that that is done each time a user types a label.) > characters. So we may need to agree that there is disagreement > on this subject. I have to rely on the IAB for guidance about > how to move forward. It does not appear to me, personally, that > further hair-splitting is useful. > > > If that charset, or the local charset being used by the relevant operating system or application software, is not Unicode, a further conversion must be performed to produce Unicode. How often this is an issue depends on estimates of how widely Unicode is deployed as the native character set for hardware, operating systems, and applications. Those estimates differ widely, with some Unicode advocates claiming that it is used in the vast majority of systems and applications today. Others are more skeptical, pointing out that: o ISO 8859 versions [ISO.8859.1992] and even national variations of ISO 646 [ISO.646.1991] are still widely used in parts of Europe; o code-table switching methods, typically based on the techniques of ISO 2022 [ISO.2022.1986] are still in general use in many parts of the world, especially in Japan with Shift-JIS and its variations; o that computing, systems, and communications in China tend to use one or more of the national "GB" standards rather than native Unicode; >> This really misses and exaggerates the point I was trying to >> make. It is not that all text is encoded in Unicode; that is >> clearly false. It is that >> (1) essentially all modern systems have the ability to handle >> Unicode, and >> (2) essentially all charsets -- especially when limited to the >> repertoire of characters in IDN -- are losslessly convertable >> to Unicode. >> > > > We understood your point. We have also received considerable > input that disagrees about the statements above. Now, I will > stipulate that the statements are almost certainly true --almost > tautologically so-- if words like "essentially" and "modern" > are used to discount any system that does not support your > predicate conditions. But that isn't interesting. > > I still disagree with John here. This passage implies that there are significant problems in mapping to Unicode in doing IDN, and there simply aren't. > > > >> ... >> >> >>> There are similar words that can be expressed in multiple >>> languages. For example the name Torbjorn in Norwegian and >>> Swedish. In Norwegian it is spelled with the character >>> U+00F8 (LATIN SMALL LETTER O WITH STROKE) in the second >>> syllable, while in Swedish it is spelled with U+00F6 >>> (LATIN SMALL LETTER O WITH DIAERESIS). Those characters >>> are not treated as equivalent according to the Unicode >>> consortium while >>> >> >> according to the Unicode Standard or such standards as ISO >> 8859-1, while >> > > > That really is not the issue. Any of us could identify a whole > series of standards and models that code them differently. > Clearly, for typographic purposes and by inspection, they are > different. To someone having no context in those languages and > looking at them, the notion that these characters were the same > would be at least as astonishing as the assertion that > upper-case "I" was the same character when presented with and > without serifs while "O" and "Q" are different. > The problem here is very important in the IDN space, although > not in the coding and typographic space. That problem is that > non-experts tend to attach what I can loosely describe as > semantics to characters in context. And that is exactly what > the text... > > > >>> most people speaking Swedish, Danish and Norwegian >>> probably think they are equivalent. >>> > > > is identifying. > > > >> Add the parallel cases: >> >> => As another example, "theatre" and "theater" are considered >> to be the same English words, spelled slightly differently. >> Yet here again it is neither possible nor desireable to make >> "re" and "er" equivalent on a global basis. >> > > > It is actually a slightly different issue. It could go in, as > could "colour" and "color" (same problem if viewed as "spelling" > but a slightly different one if viewed orthographicall), but not > without significantly more explanation. In particular, the > Swedish-Norwegian example is a single-character match, changing > between languages, while theatre/ color are part of systematic > differences in standard orthography within what is allegedly the > same language. > > Still disagree with John. > > >>> ... >>> >> >> You appear not to have seen Ken's email on the situation with >> Latin vs Han unification. There were *not* different models >> used for the "unification" of Han and the "disunfication" of >> Latin and Greek. There is also a continued fuzziness over >> character vs glyph in this paragraph. >> > > > I've seen it. I would also encourage you to reread the > subsection titled "Unification" starting on p19 of _Unicode > 4.0_. It is, IMO, a masterpiece of rationalization: > > Paragraph 1: "The Unicode Standard avoids duplicate encoding..." > Paragraph 2: Again, "The Unicode Standard avoids duplication..." > Paragraph 3: "There are a few notable instances ... where visual > ambiguity between different characters is tolerated... not > unified because they are characters from different scripts... > legacy character encodings that distinguish them... Unifying > these uppercase characters... unnecessary complications" > Paragraph 6: "There are many characters ... that could have been > unified with existing visually similar...or that could have been > omitted in favor of some other Unicode mechanism..." > > And so on. You can quibble with the word "model" or exactly how > the situation is described. It remains, however, that there are > communities who claim that Kanji and Han-as-used-in-Chinese are > different scripts since each one, in practice, contains > characters not present in the other. The Unicode Consortium has > rejected those claims (and I have no basis to disagree with your > conclusion) but the decision is the result of evaluating a > series of tradeoffs with usage and legacy standards (as the > Standard itself seems to make quite clear). However, if one > applies the same logic that is used to reject the Kanji-Han > distinction to Greek--Roman-Cyrillic, ignoring legacy standards > and perhaps case compatibility and other issues, one would end > up with a unified GRC script. But, for better or worse, there > was more legacy usage of Western European characters and > standards when the ISO 636 BV -> 8859.n -> 10646 evolution began > than there was of unified CJK[V]. What that section of the > Standard essentially says to me is that there were many > tradeoffs and choice points and that reasonable and rational > decisions were made on a case by case basis. I think that is > fine -- it is certainly consistent with my understanding. And, > as you have presumably heard me say many time before, I think > that, on balance, the right decisions were mostly made... and > that applying any different set of tradeoffs would have just led > to a different mix of difficulties. > > But that section does not say "single set of clear rules, > consistently applied". It says "different situations considered > in context, and tradeoffs among several competing principles > were applied to get to the decisions". To say, at least in my > vocabulary of what a "model" means, that you had one model and > applied it consistently would permit you to rewrite that section > so it didn't contain a series of "tolerated differences", what > could have been done one way or the other with no clear decision > rule, and dependency on how a script is defined. > > > >> Suggested change: >> >> Fundamental to any character encoding standard is the process >> of determining the abstract characters to be encoded. The >> original goal in Unicode was to associate a single numeric >> value, called a code point, with each abstract character. For >> various reasons, primarily compatibility with other charsets >> to allow for lossless conversion, this goal was not always >> reached. In particular, there are a number of alternative ways >> to encode the same content, such as: >> - some text can either be represented as a single code point >> (such as an A-umlaut or a Hangul Syllable character), or by a >> sequence of multiple code points (such as an 'a' followed by a >> non-spacing umlaut, or a sequence of Hangul Jamo). >> - the half-width and full-width variants of characters from >> East Asian sets, such as katakana. >> - the positional variants (initial, medial, final, and >> isolated) and ligatures of Arabic characters, originally for >> use by printers >> - a large number of characters for "mathematical" use, even >> though their representative glyphs are within the accepted >> range of glyph variation for the customary characters. >> > > > I would encourage you to incorporate this text into the > Unification section of the 5.0 book. > > We actually have a great deal of text explaining this already -- so it is unfortunate that nextsteps is so misleading. > > >>> character-based normalization tables can provide. >>> >>> If we leave Roman-based scripts and examine those based on >>> Chinese characters, we see there is also an absence of >>> specific, lexigraphic, rules for transformations between >>> Traditional and Simplified Chinese. Even if there were >>> such rules, unification of Japanese and Korean characters >>> with Chinese ones would make it impossible to normalize >>> Traditional Chinese into Simplified Chinese ones without >>> causing problems in Japanese and Korean use of the same >>> characters. >>> >> >> Delete the above paragraph. As per other email, the >> transforming between traditional and simplified Chinese are >> are more akin in many ways to transforming o-stroke and >> o-umlaut, or even "theater" to "theatre", since there are also >> vocabulary changes. >> > > > The difference, at least so far, is that _in an IDN context_ the > CJK community has chosen to deal with the SC<->TC differences by > the use of variant structures, while, e.g., the British and > American communities have chosen to deal with "theatre" and > "theater" by either ignoring the issue or using > post-registration dispute resolution procedures. If this > document were being written about general use of character > strings, its content might be quite different. But we need to > recognize both the constraints of the DNS and what perceived > solutions have been (or are being) used to cope with the > perceived problems. > > We need some text here to respond to John. > > >>> Any canonicalization operations that depend on more than >>> short sequences of text is not possible to do without >>> context. DNS lookups and many other operations do not >>> have a way to capture and utilize >>> the language or other information that would be needed to >>> provide that context. >>> >> >> This paragraph assumes that language-based canonicalization is >> a good thing; the document has not made the case for that >> being so. *Especially* after the document above says: " It >> is neither possible nor desirable to make these characters >> equivalent on a global basis. " Remove it. >> > > > I'll leave this one to the IAB. I read it as a simple statement > of fact that should be present because --again, in the IDN and > DNS context-- there have been proposals to do just this. You > see it as an endorsement for such canonicalization. > > > >>> The difficulty, in short, is that two Unicode strings that >>> are actually different might look exactly the same, >>> especially when there is no time to study them. This is >>> because, for example, some glyphs in Cyrillic, Greek and >>> Latin do look the same, but have been assigned different >>> codepoints in Unicode. Worse, one needs to be reasonably >>> familiar with a script and how it is used to understand >>> how much characters can reasonably vary as the result of >>> artistic fonts and typography. For example, there are a >>> few fonts for Latin characters that are sufficiently >>> highly ornamented that an observer might easily confuse >>> some of the characters with characters in Thai script. >>> >> >> As said before, this last is completely bogus. If you are >> going to state it, then at least give the reader the proper >> context, by adding: >> >> => Of course, with the right font, any Latin characters can be >> made to look like any other Latin characters as well. >> > > > A font designer who made, e.g., "X" look like "O" --an > equivalence your proposed addition claims would be reasonable-- > would, IMO, rapidly be in search of a new occupation. That is > not the case, e.g., for the type of confusion between, e.g., > ASCII/Roman "U" and Thai characters that look like it with or > without various loops and descenders, both of which are used to > decorate Roman characters in ornamented fonts. > > Needs response to John. > > >> ... >> >> >>> As Unicode and conventions for handling so-called >>> bidirectional ("BIDI") strings evolve, the prohibition in >>> IDNA should be reviewed and reevaluated. >>> >> >> As Michel has written, this last sentence is misleading. I'd >> suggest deleting, but he may have some replacement language >> > > > We have already explained why the existing text is relatively > neutral and have asked for proposed alternate text that > preserves the general concept. None has been forthcoming so far. > > Michel, perhaps you can make a suggestion. > > >>> 2.2.8. Versions of Unicode >>> >>> While opinions differ about how important the issues are >>> in practice, the use of Unicode and its supporting tables >>> to support IDNs appears to be far more sensitive to subtle >>> changes than typical Unicode applications. This may be, >>> at least in part, because many other applications are >>> internally sensitive only to the appearance of characters >>> ... >>> >> >> It is really unclear what is meant by these "subtle changes". >> Please provide an example. >> > > > We have toned this down about as far as the IETF seems willing > to go and "subtle" was a concession. The bottom line is there > is a difference in philosophy and interpretation. In your view, > as I understand it, a few small changes (such as the notorious > seven character corrections) are really irrelevant if those > changes are unlikely to impact normal text or normal use of the > characters in context. In our view, the ways in which the DNS > (and, incidentally, X.509 identifiers) are used in practice make > these changes like any other changes with comments about > normal-text context and usage being irrelevant. So, to at least > some of us, when you say "no changes have been made" (whether > you put "significant", "in practice", or their equivalent into > that sentence or not), the immediate questions are, roughly, > "and when did 7 start being == 0"? > > > >> This implies that there have been changes in code point >> assignment or definition, which is false, and against the >> Unicode policies. >> > > > See above. > > > >> This whole section is much more nuanced now. But I think it >> still doesn't make it clear to users that the changes in >> normalization, in practice, will have no effect on the usage >> of IDNs. So for accuracy, you should make the following >> replacement, or something like it. >> > > > See my previous notes (and I think Patrik's) on the use of "in > practice" vis-a-vis the DNS. > > > >>> The IAB has concluded that there is a consensus within the >>> broader community that lists of codepoints should be >>> specified by the use of an inclusion based mechanism >>> (i.e., identifying the characters that are permitted), >>> rather than by excluding a small number of characters from >>> the total Unicode set as Stringprep and Nameprep do today. >>> >> >> Note: the consortium is on record as supporting this, so it is >> probably worth adding. Add: >> > For example, the Unicode consortium recommends a >> significant restriction of characters, to be inclusion-based: >> a profile of those recommended for identifiers. See [UAX31] >> and [UTS36]. >> > > > IAB instructions needed on this. > > > >>> In this and other issues associated with IDNs, precise use >>> of terminology is important lest even more confusion >>> result. The definition of the term 'homograph' that >>> normally appears in dictionaries and linguistic texts >>> states that homographs are different words which are >>> spelled identically (for example, the adjective 'brief' >>> meaning short, the noun 'brief' meaning a document, and >>> the verb 'brief' meaning to inform). By definition, >>> letters in >>> >> >> A better example would be words that are etymologically >> unconnected. >> > > > Perhaps. But it would make the text even more complicated and > we haven't even started talking about etymological distinctions. > Moreover, we are writing this in English, which is an > unfortunate choice. Having had to witness a few arguments among > experts about words that appear to be unconnected because one > derived from Latin-French roots and the other from Germanic ones > but that arguably trace to a common proto-IndoEuropean origin, I > don't see much advantage to opening this can of worms. > > > > > >>> o Finally, if IDN labels are to be placed in the root >>> zone, there are issues associated with how they are to be >>> encoded and deployed. This area may have implications >>> for work that has been done, or should be done, in the >>> IETF. >>> >> >> The document should clarify what the "encoding issues" are; >> why one would consider an alternative to the >> stringprep/punycode approach used for IDN elsewhere. >> > > > That isn't the issue and one should not consider such things. > But this is a different can of worms. It could be opened, but > only at the expense of another page or two of text. I'd prefer > to get this out rather than go there, but it is up to the IAB. > > > >>> 4.1.2. Elimination of word-separation punctuation >>> >>> The inclusion of the hyphen in the original hostname rules >>> is a historical artifact from an older, flat, name space. >>> The community >>> should consider whether it is appropriate to treat it a >>> simple legacy property of ASCII names and not attempt to >>> ... >>> >> >> While this is really not a Unicode issue, it is unclear why >> this is called out. There are very few such characters, and a >> case should be made for why they are a problem. At the very >> least, the authors should make clear which ones they consider >> problematic, rather than making this blanket statement. >> > > > The problem is actually a little different. Accepting your > statement that there are very few such characters, what happens > as soon as one tries to institutionalize --across scripts-- the > notion of a full or partial word-separator, then everyone (or > some fairly large group) seems to want one. That, in turn, > leads to arguments about non-breaking white spaces and other > sorts of tricky things. > > Better text would be welcomed, although it won't make -02. > > > >>> 4.3. Combining Characters and Character Components >>> >>> One thing that increases IDNA complexity and the need for >>> normalization is that combining characters are permitted. >>> Without them, complexity might be reduced enough to permit >>> more easy transitions to new versions. The community >>> should consider whether combining characters should be >>> prohibited entirely from IDNs. A consequence of this, of >>> course, is that each new language or script would require >>> that all of its characters have Unicode assignments to >>> ... >>> >> >> This paragraph, as noted before, simply needs to be removed. >> Removing all combining marks from many languages would be like >> eliminating all vowels from English. While in theory it would >> be possible to encode all combinations of Latin consonant plus >> vowel combinations, it simply is not going to happen in >> Unicode. >> > > > This has been discussed before. There is no disagreement that > removing all of the symbols now formed using combining marks > would be severely problematic (without making a judgment as to > whether or not your analogy is correct, although I think it is). > However, I don't know how to understand "it simply is not going > to happen in Unicode". > * If it means "I've been around the Unicode Consortium > for a long time and am currently its President, and it > is my opinion that you would never get consensus for > such a change" then that is a reasonable statement, but > the observations above are proper. Someone who > believed that the best solution for some problem was > precisely to encode all of those combinations with > "Latin" characters would be free to try to propose it > and make a persuasive case, either to UTC or to SC2. > Perhaps you might be wrong: while it would be very > upsetting to the Standard to do something like that, it > ought to be possible to ask the question and, if the > argument is strong enough to overcome all sorts of other > factors, to make the change. > > * If it means "No matter how much consensus is gathered > and what arguments are made, it won't happen because I > and some group of others will block it and have enough > power to succeed" would immediately redefine UTC from a > consensus standards-producing consortium into some other > sort of creature. The implications of that are best > discussed with lawyers and I don't even want to > speculate. But, if that were the case, then the > paragraph in the text should not be removed but should > be modified to say that the proposal should not be > considered within the Unicode context because the > Unicode Consortium has de facto rules that would make it > impossible for the proposal to even be considered. Of > course, from an IETF standpoint, that sort of situation > might call for revisiting the decision to use Unicode at > all. > > Fortunately, I don't believe the second is either what you > intended or the case, so we don't need to deal with the > implications or contingencies that would apply if it were. > But I can't see the justification for removing the comments just > because you predict that a proposal derived from them would not > be approved. > > See my response already sent to this list.