L2/99-044 From: Martin J. Duerst [duerst@w3.org] Sent: Monday, February 01, 1999 4:45 AM To: Multiple Recipients of Unicore Cc: unicore@unicode.org Subject: NORMALIZATION Hello Mark, I guess you must be very buisy, and I'm late with these comments and very sorry about it. I was a bit out of the loop over Christmas/ New Year, but now I'm back. Here is my feedback on the normalization draft technical report (http://www.unicode.org/unicode/reports/tr15/tr15-10.html). It's in three parts. First, things that I think might affect Unicode 3.0. Second, details on your TR. Third, some general considerations. I will send a copy of this mail also to the W3C I18N WG. Things affecting Unicode 3.0 ============================ Sorry, I only got the code charts, and not the actual text, so I can only comment on previous discussions and what's in 2.0/2.1. This is mainly Korean canonicalization. What we want is that decomposition results in simple three-code L(eading)V(owel)T(railing) or two-code LV sequences. We therefore have to move the Jamo subcomposition from being a docmpatibility decomposition to being some separate kind of decomposition, as Ken suggested it and we discussed it in December. In the database, this can easily be taken into accout by just giving these decompositions a separate label, instead of the that we have currently. Another thing that we have to consider in this context is the interaction of current-time hangul precompositions (johab) and ancient-time jamo. Basically, for cases such as e.g. KAKR. The problem here is that KR is listed as a trailing jamo (U+11C3), but it is not a modern one, and therefore KAKR is not in the johab list. KAKR is representable at least as: K - A - KR (favorize 2/3-code jamo) KA - KR (use precomposition, but keep individual syllable components together) KAK - R (do greedy composition from the start) The first variant is what I described in my early internet draft. It seems to be the most natural solution in this case, but I know "natural" is very subjective. Some of the current text about how to construct johab affects this. We should try to make sure that: - The text describing johab composition and the C form of Hangul match. - The two things above match general expectations, if there is anything like that. For the vowels, the same problem appears. On the leading component, e.g.: NKAK (NK is U+1113), there seems to be only one reasonable representation, namely NK - A - K Or is there another one? We should also make clear that (whether?) the insertion of fillers as described on page 3-12 of Unicode 2 is part of canonicalization. Comments on the TR itself ========================= - It's very good to have the things that are not yet fixed explicitly mentionned. - Mike Ksar made me aware of one specific problem: What happens with combining characters at the beginning of a string? Concatenation is one of the most frequent operations on strings, I guess. It would be nice if the concatenation of two canonicalized strings was again canonical. How can we make sure that this happens? Should we specify that initial combining marks get changed to their spacing equivalents? That a space is introduced and then the algorithm is applied? Or what else? - In the "labels" table, you write "see below". Please say where exactly the reader should look. - There is a paragraph about "Level 1 as defined in ISO/IEC 10646-1". Please make sure that this paragraph mentionnes versioning; for later versions than the "cutoff version", this will not be true anymore. - The sections "labels", "notation", and "definitions" are difficult to distinguish. If one has a very close look, one sees why the distinction is made, but most readers won't go that deep, and will be rather confused. This can either be improved by changing the titles, or changing the structure even more. E.g. the section on labels could be integrated with the Introduction (which already mentions all four forms) by just saying that a table gives an overview of these four forms, the labels that are defined for them,... - The "table of contents" should be moved to before the Intro. - The specification section takes a first go ("Basically,..."), and then starts again. As this is the specification, I propose to really only just have the actual definitions. In general, beware of the word "basically". - In the specification section, change the subtitle e.g. to "Normalization Form C" only, then add a sentence such as "The Normalization Form C for a string S is obtained by applying the following process, or any other process that leads to the same result". - Part of the "notation" is only used in the examples section. Please move that there. Why use ND(), NC(),...? Why not just D()? The parentheses make clear whether it's the normalization form label or the function. - "Primary canonical compositions": This is an interesting concept. But I think we should turn things around. Instead of marking the usual case as "primary", we should mark the exception case as something like "non-recomposing". It turns out that this case at least contains the following: - One-to-one compositions such as Kelvin and Angstrom - Special cases such as Hebrew - Post v3.0 precomposed characters Because we need the second and the third case in the database, we don't need to introduce definitions for the first case. [In the database file, I propose to use a label such as ; this would make the v3.0 database file slightly incompatible with the v2.x database file, but the v2.x one could be produced easily with a sed/awk/perl expression such as "s///".] Single-character decompositions would then just be listed as , and wouldn't be a special case anymore. With this, the section on definitions could easily be integrated with the specification section. The specifications themselves would then become a bit longer, but I think this is desirable. Trying to factor out everything from the specification itself, and making the specification as short and crisp as possible, is an interesting intellectual exercise, but it means that an implementer has to assemble all the factored-out pieces, and will easily overlook one of these pieces. - Goals: Please say (ideally in the title) whether these are the goals of the algorithm design or the goals of the compositions, or what. Ideally, move that section to immediately after the intro, or to an appendix. Now it stands between sections that the reader needs to implement the algorithm. - I think the second design goal is difficult to understand. "stability" is a very general term. - Conformance should come after the specification itself. - In the actual specification, it is not clear whether once one has found a C-B pair that combines, one has to start searching again all the B between the C and B just found or not. Please make this explicit. (It also might depend on the tables used). General Considerations ====================== I have contacted several influential and knowledgeable people in the IETF, and presented them with the following list of how to recombine after decomposition and cannonical ordering: 1) Stay with decomposed for everything 2) Use precomposed if there is a precomposed character that matches the full string, otherwise use fully decomposed 3) Use a precomposed character for the longest possible initial match, the rest with combining characters. 4) Pairwise absorb combining characters into the base character, checking the combining characters in order. Stop when all are checked. Leave those as combining characters that have not been absorbed. 5) Check all possible combinations of precomposed characters and combining characters that are equivalent; chose the shortest one; use some tiebreaker if there is more than one. Everybody I presented with this list choose 3). At worst, this means that 4) (the current proposal) will not be accepted by the IETF. This would be too bad, because it would exclude 4). At best, this is a strong indication that 4) is quite difficult to understand, and we have to make a lot of efforts to describe it very clearly, to explain its advantages, and to provide implementations. I'll try to fathom further on which of the above applies. If you want to be part of the discussion, please tell me, I guess that might help on both sides. A completely different point, but nevertheless important: Does the Unicode Consortium have any policies about IPR? In particular, does it have any disclosure policies for patent claims? This is a 120% hypothetical example (I hope), but suppose somebody files a patent on a certain way to encode Mongolian variants. That party collaborates in designing the Unicode Mongolian solution, and successfully manages to get their ideas adopted. Later, the patent is granted, and this party is now asking everybody implementing Unicode to pay them licence fees for Mongolian. Any informations on this are greatly apreciated. Regards, Martin. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org