L2/06-162 Date: Fri, 5 May 2006 18:28:11 -0700 (PDT) From: Kenneth Whistler Subject: Issue about Named Character Sequences An issue came up in WG2 regarding named character sequences that the UTC needs to consider and come up with a decision about. The issue is this: It is possible to have multiple different combining character sequences that are canonically equivalent, and which should render the same and which to an end user represent the "same" character. As it stands now, UAX #34, Unicode Named Character Sequences, spells out the definition and syntax for specifying named character sequences, but is silent on the issue of how canonically equivalent sequences should be handled in that context. The namespace uniqueness requirements for character names, formal name aliases, and named character sequences would prevent the option of having one particular name being reused for canonically equivalent sequences, unless that is defined in some special way to prevent name clashes. This issue came up because the Finnish NB requested a list of named character sequences (named USI's) in 10646 for multiply-accented Lithuanian characters. This is a *good* thing, by the way, because it heads off the periodical request for encoding of precomposed characters in the standard, which is the natural alternative that some NB's might otherwise continue to pursue. Asmus has posted up the exact list in question in a separate document, as an update to NamedSequencesProv.txt for the Unicode Character Database. To make the issue clear, I'll just pick a single, doubly-accented character from that list. One of the Lithuanian characters requested is: LATIN SMALL LETTER A WITH OGONEK AND ACUTE That isn't encoded as a precomposed character in Unicode. The requested character sequence to be named is: <0105, 0301> LATIN SMALL LETTER A WITH OGONEK AND ACUTE But, of course, there are *4* possible sequences to represent this, all canonically equivalent: 1. <0105, 0301> = NFC 2. <00E1, 0328> 3. <0069, 0328, 0301> = NFD 4. <0069, 0301, 0328> The four sequences are canonically equivalent, but only one of them (#1) is normalized in NFC and only one of them (#3) is normalized in NFD. Now it happens that the Finnish NB request is for the first of these as the named sequence, and that is not only in NFC, but also corresponds to the Lithuanian user's perception of what the units are. Lithuanian stress accents are represented with grave, tilde and acute diacritics -- the base vowels include the letters with ogoneks. So sequences #2 and #4 don't make much sense in Lithuanian. But nothing in the rules for named character sequences currently would prevent someone from requesting #2, #3, or #4 (or all of them) to all be given names, and the rules for naming now would require that any such requests come up with distinct names. So in principle you could get (making this up as I go along): 1. <0105, 0301> LATIN SMALL LETTER WITH OGONEK AND ACUTE 2. <00E1, 0328> LATIN SMALL LETTER WITH ACUTE AND OGONEK 3. <0069, 0328, 0301> LATIN SMALL LETTER WITH XXX OGONEK AND ACUTE 4. <0069, 0301, 0328> LATIN SMALL LETTER WITH XXX ACUTE AND OGONEK (where I've tossed in the "XXX" just as an arbitrary distinction, because we don't have a naming convention established for this kind of distinction yet). This would comply with the letter of the UAX #34 rules and the requirements for namespace uniqueness currently in effect. However, it is hardly a desirable outcome, because it would (in principle) result in four *different* names for what are canonically equivalent sequences indistinguishable to an end user. Here are the alternatives as I see them. A. Status Quo No restrictions on standardizing distinct named character sequences for canonically equivalent sequences. Require that each such named sequence have a distinct name. B. Restrict to NFC Require that *only* sequences in NFC be standardized as a named character sequence. This would automatically prevent the need to clone names for canonically equivalent sequences. C. Restrict to any one equivalent sequence Require that only one of any set of canonically equivalent sequences be standardized, but not require that that one necessarily be NFC. This option would also automatically prevent the need to clone names for canonically equivalent sequences. D. Standardize *all* canonically equivalent sequences as a set Modify the syntax of NamesSequences.txt (if necessary), and verify that when a named sequence is added, *all* canonically equivalent sequences are associated with the same name. This option also automatically prevents name cloning. E. Allow identical names for canonical equivalents This option would relax the namespace uniqueness requirement in that it would stipulate that any canonically equivalent sequence could be made a named sequence, but it would be required to have the same name as a prior standardized named character sequence. This option explicitly allows *identical* names for canonical equivalent sequences, and would prohibit the kind of distinct naming listed above for #1 - #4. ======================================================= Recommendation: My opinion here is that Option B is probably the best choice and the easiest to maintain. It would require no change in the current mechanism or implementations. However, if chosen, I would also recommend that in all known cases where a named character sequence has one or more canonically equivalent sequences, that the *entire* set be added to NamedSequences.txt in some form of a comment. This would make it clear to readers what the intended range of representation and designation for the named character sequence is, and would help prevent requests coming in to name alternate sequences that are equivalent. A potential drawback to Option B is that there may be particular cases where NFC turns out not to be the ideal representation for a particular sequence that some petitioner wishes to standardize a name for. An advantage to Option B is that to date (unless I have missed something), all requested named character sequences are in fact in NFC, including the new list of Lithuanian character sequences. Option D is similar in end effect, but would be a little harder to implement in UAX #34 and in NamedSequences.txt. We would have to invent some specification and syntax to indicate that multiple, canonically equivalent sequences were normatively listed explicitly in the data file for any such named character sequence. This might satisfy the political aspect of the problem in any instance where NFC turned out not to be the "best" sequence, but the drawback would be that it would complicate the parsing of NamedSequences.txt unnecessarily. Option C would work just like Option B, but without a priori advantaging any normalization form (or even requiring that a standard named character sequence be in any normalization form). This might be politically a little more palatable, but would end up making the list and implementations a bit messier. Option E is effectively an alternate way to express canonical equivalence. Its main disadvantage would be that it complicates testing for name uniqueness. It would require some tweaking both in the text of UAX #34 and in other places where we discuss name uniqueness.