L2/99-174 From: Martin J. Duerst [duerst@w3.org] Sent: Monday, June 07, 1999 5:25 AM Subject: Unicode TR #15 at the next UTC Hello Lisa, dear Arnold, I am sending this note in my function as liaison person from the W3C I18N WG to the Unicode Consortium to you in your functions as chair of the Unicode Technical Committee, and as a chair of LC2 (where appropriate). I would like to ask you to make this document available to all participants at the UTC next week. Please feel free also to send it to any relevant email lists beforehand if you consider that appropriate. I am very sorry that I cannot participate in person due to various other duties. However, I would very much like to participate by telephone; if it is possible to arrange that discussions on the topics mentionned below are carried out after 2pm (which is 6am here in Japan), and I know (a day in advance) when and where to dial, that would be great. I have copied Murray Sargent so that he may tell us whether teleconferencing equipement is available. Many thanks in advance. This document contains requests from the I18N WG of the W3C (World Wide Web Consortium) regarding the current version of Draft Unicode Technical Report #15 (http://www.unicode.org/unicode/reports/tr15/tr15-14.html) according to the discussions at the last meeting of our WG. It also contains other, including editorial, requests and comments to the above document. Although these have not been discussed at the WG meeting, they are included here because we very much want to help to make sure that TR #15 is clear and easy to understand, and that it fulfills our requirements as laid out in our Working Draft "Requirements for String Identity Matching and String Indexing" [publicly available at http://www.w3.org/TR/1998/WD-charreq-19980710#3]. At its last meeting the W3C I18N WG was strongly concerned about the algorithm used in Unicode DTR #15. On behalf of the W3C I18N WG, I herewith request the UTC to reconsider this issue and to change the algorithm used for Canonical Composition (Normalization Form C) from the so-called "FineComposition" to "MiddleComposition". The main reasons for this request are as follows: - The W3C I18N WG, for the W3C as a whole, is, as far as we know, the main "customer" for Normalization Form C (see our Working Draft "Character Model for the World Wide Web" at http://www.w3.org/TR/1999/WD-charmod-19990225). - We want Normalization Form C to be used as widely as possible. For this to happen, the easier it is to understand TR #15, and the more straightforward and obvious the algorithm choosen, the higher the chance it will be used. - The fact that "FineComposition" in some cases composes better than "MiddleComposition" is irrelevant to us, because compaction as such was never a goal, and because the cases where there is a difference are extremely rare. - We expect Normalization Form C to be implemented in various contexts, in particular: - Normalization Form D (canonically decomposed) => Normalization Form C - Normalization Form C => Normalization Form D - With only a subset of the Unicode characters as input - Independent of other Unicode processing In these cases, the simplicity of "MiddleComposition" is clearly a gain. - "MiddleComposition" allows the same high-speed optimizations and other implementation benefits as "FineComposition", contrary to earlier assumptions. - The exact conditions for composition in the case of "FineComposition" depend on the combining class of the characters (zero vs. non-zero). This may lead to undesirable dependencies between reordering behaviour and composition behaviour. Also, the exact conditions are somewhat arbitrary, and may lead to diverging implementations. "MiddleComposition" does not contain any such arbitrariness. (see below). Given the above reasons, we would very much appreciate it if the Unicode Technical Committee could choose "MiddleComposition" for TR #15 at its upcomming meeting. Requests related to scripts and data ==================================== Thai/Lao: The treatment of THAI CHARACTER SARA AM and LAO VOWEL SIGN AM (in the database and the algorithm) should be revisited based on comments from specialists. In case they and their decompositions stay as canonical equivalents, whether or not they should be in the Composition Exclusion Table should be reconsidered based on actual practice. Furthermore, if they indeed stay in the Composition Exclusion Table, they should be included under script-specifics; there are too few of them to warrant a special category. Furthermore, if current practice has them precomposed, and in case "FineComposition" should be kept, changes to the "FineComposition" algorithm are needed to deal with these cases appropriately. Vietnamese: At our WG meeting, Michel Suignard had some comments regarding current Vietnamese (de)composition practice. He promized to follow up on that, but I didn't see any mail yet. I have copied him on this mail, and I assume that he will be present at the UTC meeting. Korean: - Creation of a new category "Hangul Jamo Decomposition", see below (in case the next point is not accepted, the lines marked with *** would have to be changed). - Inclusion of these decompositions in canonical equivalence, but with their canonical ("decomposed") form being composed. - Addition of two codepoints for completeness/regularization *11A3 O-YEO ~= 1168 O + 1167 YEO *11A4 U-YEO ~= 116E U + 1167 YEO - Change of the algorithm in TR #15 so that for Normalization C, the following steps are performed for Hangul (as always, any shortcuts allowed if they lead to the same result): 1) Hangul Syllable Decomposition 2) Decomposition according to "Hangul Jamo Decomposition" 3) Cleanup of fillers according to result of discussion initiated by Mark Davis 4) Recomposition using "Hangul Jamo Decomposition" 5) Recomposition of Hangul Syllables These steps should be explicitly mentionned in the There are some alternatives/details that should be considered for the last two steps. As written, it is a two-step process, which gives priority to recomposing syllable components (L/V/T) over trying to get as much as possible into into a composed syllable. There are two alternatives (giving label 1) to the proposal above): 2) In the last step, only recompose syllables if everything belonging to a Hangul Syllable can be recomposed (i.e. no vowel or trailing consonant component is left alone), in other words, only use Hangul Syllables for modern Hangul, for older syllables, remain on the Jamo level for the whole syllable. 3) Do 4) and 5) in a single step. This is possible because the components of modern Hangul decompose regularly. The differences and similarities between the three alternatives can be seen in the following examples (final kr is only available in ancient Hangul, but final ks is modern): For the syllable kakr, Variant 1) would lead to ka + kr (the first being a syllable, the second a jamo) Variant 2) would lead to k + a + kr (all jamos) Variant 3) would lead to kak + r In all cases, the syllable kaks would end up as one syllable, kaks, independent of the form in which it came in (e.g. kaks, kak + s, ka + ks, ka + k + s, k + a + ks, k + a + k + s). ***** That such syllables get normalized this way is the major aim of the whole exercise! I currently have no preference among variants 1), 2), and 3); more information on local practice seems necessary. Non-zero - Non-zero canonical equivalences: The current composition algorithm ("FineComposition") does not recompose pairs of characters that both have a non-zero combining class. UnicodeData-3.0.0d8.txt contains the following five such cases: Non-zero composition: ̈́ from ̈́ (COMBINING GREEK DIALYTIKA TONOS) Non-zero composition: ై from ై. (TELUGU VOWEL SIGN AI) Non-zero composition: ཱི from ཱི. Non-zero composition: ཱུ from ཱུ. Non-zero composition: ཱྀ from ཱྀ. (three cases in TIBETAN) These should be reexamined. At least the TELUGU case definitely should go precomposed, to be parallel with the other Indic scripts. The Tibetan cases probably also should go precomposed. The class assignement is made based on where one wants to reorder things (and where not), and this should not affect composition where not appropriate. Editorial and other comments ============================ - The last paragraph of the Introduction says: "The decomposition normalization forms D and KD *are* closed under string concatenation and substringing." This is wrong and should be corrected (assume the first string contains a and a grave accent, the second string a cedilla; if they are concatenated, the grave accent and the cedilla have to be exchanged to get canonical ordering). On the other hand, both representations are closed under substringing, i.e. substrings of codepoints from a representation conform to the representation they originate from. - The discussion on legacy character encodings, currently in the "Definition" section, should be moved to an appendix, and changed to be more advisory than defining. Also, the understanding of "unnormalizable" should be changed so that the "under common transcoders" is removed. Wherever possible, we want transcoders to do the right thing, rather than encodings to be labeled as unnormalizable. - After changing from "FineComposition" to "MiddleComposition", the section on Definitions can be removed. [In case "FineComposition" should be kept, the definitions related to the specification of the algorithm should be integrated into the Specification section in order to make the document easier to read for implementers. Also, the difference between the Specification section and the Java code regarding sequences of more than two characters of combining class zero should be removed, most probably in favor of the description so that cases with more than two class zero characters can be combined (there are currenly no such cases with the important exception of Korean).] - In "Conformance", the last clause should be changed "and then testing bit-for-bit identity." -> "and then testing code point for code point identity.". - The section "Composition Exclusion Table" should start with a short overview of the various categories, and then go on to give examples and discussions of the various categories. - In the Examples section, more examples should be given for Hangul (including cases where full recomposition is not possible). - The section on Hangul Character Names should be removed. Although the algorithm is very similar to Syllable Decomposition, this is completely unrelated to the topic of the TR, and may lead to confusion. - Change as follows (Introduction): "it does not precisely specify the format" -> "it does not define any specific format" - Add section/subsection numbers. - Add a section with references. Change W3C references to exact title, and add "work in progress". - Make sure that links can also be followed in the printed version. (e.g. "More complete examples are provided below." in the Intro) - In "Normalization Form KC", point 2, bullet, a "the" is missing. Regards, Martin. Appendix A ========== Java algorithm for Middle Composition static void middleCompose(String source, StringBuffer target) { char buf = source.charAt(0); for (int i = 1; i < source.length(); ++i) { char ch = source.charAt(i); // check if the new character combines with the // buffer character char composite = pairwiseCombines(buf, ch); if (composite != NOT_A_CHAR) { buf = composite; // then replace } else { target.append(buf); // add buffer to target buf = ch; // put new character into buffer } } // add last buffer target.append(buf); } Appendix B ========== Proposed Hangul Jamo Decompositions ᄁ decomposes to ᄀᄀ. ᄄ decomposes to ᄃᄃ. ᄈ decomposes to ᄇᄇ. ᄊ decomposes to ᄉᄉ. ᄍ decomposes to ᄌᄌ. ᄓ decomposes to ᄂᄀ. ᄔ decomposes to ᄂᄂ. ᄕ decomposes to ᄂᄃ. ᄖ decomposes to ᄂᄇ. ᄗ decomposes to ᄃᄀ. ᄘ decomposes to ᄅᄂ. ᄙ decomposes to ᄅᄅ. ᄚ decomposes to ᄅᄒ. ᄛ decomposes to ᄅᄋ. ᄜ decomposes to ᄆᄇ. ᄝ decomposes to ᄆᄋ. ᄞ decomposes to ᄇᄀ. ᄟ decomposes to ᄇᄂ. ᄠ decomposes to ᄇᄃ. ᄡ decomposes to ᄇᄉ. ᄢ decomposes to ᄡᄀ. ᄣ decomposes to ᄡᄃ. ᄤ decomposes to ᄡᄇ. ᄥ decomposes to ᄡᄉ. ᄦ decomposes to ᄡᄌ. ᄧ decomposes to ᄇᄌ. ᄨ decomposes to ᄇᄎ. ᄩ decomposes to ᄇᄐ. ᄪ decomposes to ᄇᄑ. ᄫ decomposes to ᄇᄋ. ᄬ decomposes to ᄈᄋ. ᄭ decomposes to ᄉᄀ. ᄮ decomposes to ᄉᄂ. ᄯ decomposes to ᄉᄃ. ᄰ decomposes to ᄉᄅ. ᄱ decomposes to ᄉᄆ. ᄲ decomposes to ᄉᄇ. ᄳ decomposes to ᄲᄀ. ᄴ decomposes to ᄊᄉ. ᄵ decomposes to ᄉᄋ. ᄶ decomposes to ᄉᄌ. ᄷ decomposes to ᄉᄎ. ᄸ decomposes to ᄉᄏ. ᄹ decomposes to ᄉᄐ. ᄺ decomposes to ᄉᄑ. ᄻ decomposes to ᄉᄒ. ᄽ decomposes to ᄼᄼ. ᄿ decomposes to ᄾᄾ. ᅁ decomposes to ᄋᄀ. ᅂ decomposes to ᄋᄃ. ᅃ decomposes to ᄋᄆ. ᅄ decomposes to ᄋᄇ. ᅅ decomposes to ᄋᄉ. ᅆ decomposes to ᄋᅀ. ᅇ decomposes to ᄋᄋ. ᅈ decomposes to ᄋᄌ. ᅉ decomposes to ᄋᄎ. ᅊ decomposes to ᄋᄐ. ᅋ decomposes to ᄋᄑ. ᅍ decomposes to ᄌᄋ. ᅏ decomposes to ᅎᅎ. ᅑ decomposes to ᅐᅐ. ᅒ decomposes to ᄎᄏ. ᅓ decomposes to ᄎᄒ. ᅖ decomposes to ᄑᄇ. ᅗ decomposes to ᄑᄋ. ᅘ decomposes to ᄒᄒ. ᅢ decomposes to ᅡᅵ. ᅤ decomposes to ᅣᅵ. ᅦ decomposes to ᅥᅵ. ᅨ decomposes to ᅧᅵ. ᅪ decomposes to ᅩᅡ. ᅫ decomposes to ᅪᅵ. ᅬ decomposes to ᅩᅵ. ᅯ decomposes to ᅮᅥ. ᅰ decomposes to ᅯᅵ. ᅱ decomposes to ᅮᅵ. ᅴ decomposes to ᅳᅵ. ᅶ decomposes to ᅡᅩ. ᅷ decomposes to ᅡᅮ. ᅸ decomposes to ᅣᅩ. ᅹ decomposes to ᅣᅭ. ᅺ decomposes to ᅥᅩ. ᅻ decomposes to ᅥᅮ. ᅼ decomposes to ᅥᅳ. ᅽ decomposes to ᅧᅩ. ᅾ decomposes to ᅧᅮ. ᅿ decomposes to ᅩᅥ. ᆀ decomposes to ᅿᅵ. ***ᆁ decomposes to ᆣᅵ. ᆂ decomposes to ᅩᅩ. ᆃ decomposes to ᅩᅮ. ᆄ decomposes to ᅭᅣ. ᆅ decomposes to ᆄᅵ. ᆆ decomposes to ᅭᅧ. ᆇ decomposes to ᅭᅩ. ᆈ decomposes to ᅭᅵ. ᆉ decomposes to ᅮᅡ. ᆊ decomposes to ᆉᅵ. ᆋ decomposes to ᅯᅳ. ***ᆌ decomposes to ᆤᅵ. ᆍ decomposes to ᅮᅮ. ᆎ decomposes to ᅲᅡ. ᆏ decomposes to ᅲᅥ. ᆐ decomposes to ᆏᅵ. ᆑ decomposes to ᅲᅧ. ᆒ decomposes to ᆑᅵ. ᆓ decomposes to ᅲᅮ. ᆔ decomposes to ᅲᅵ. ᆕ decomposes to ᅳᅮ. ᆖ decomposes to ᅳᅳ. ᆗ decomposes to ᅴᅮ. ᆘ decomposes to ᅵᅡ. ᆙ decomposes to ᅵᅣ. ᆚ decomposes to ᅵᅩ. ᆛ decomposes to ᅵᅮ. ᆜ decomposes to ᅵᅳ. ᆝ decomposes to ᅵᆞ. ᆟ decomposes to ᆞᅥ. ᆠ decomposes to ᆞᅮ. ᆡ decomposes to ᆞᅵ. ᆢ decomposes to ᆞᆞ. ***ᆣ decomposes to ᅩᅧ. ***ᆤ decomposes to ᅮᅧ. ᆩ decomposes to ᆨᆨ. ᆪ decomposes to ᆨᆺ. ᆬ decomposes to ᆫᆽ. ᆭ decomposes to ᆫᇂ. ᆰ decomposes to ᆯᆨ. ᆱ decomposes to ᆯᆷ. ᆲ decomposes to ᆯᆸ. ᆳ decomposes to ᆯᆺ. ᆴ decomposes to ᆯᇀ. ᆵ decomposes to ᆯᇁ. ᆶ decomposes to ᆯᇂ. ᆹ decomposes to ᆸᆺ. ᆻ decomposes to ᆺᆺ. ᇃ decomposes to ᆨᆯ. ᇄ decomposes to ᆪᆨ. ᇅ decomposes to ᆫᆨ. ᇆ decomposes to ᆫᆮ. ᇇ decomposes to ᆫᆺ. ᇈ decomposes to ᆫᇫ. ᇉ decomposes to ᆫᇀ. ᇊ decomposes to ᆮᆨ. ᇋ decomposes to ᆮᆯ. ᇌ decomposes to ᆰᆺ. ᇍ decomposes to ᆯᆫ. ᇎ decomposes to ᆯᆮ. ᇏ decomposes to ᇎᇂ. ᇐ decomposes to ᆯᆯ. ᇑ decomposes to ᆱᆨ. ᇒ decomposes to ᆱᆺ. ᇓ decomposes to ᆲᆺ. ᇔ decomposes to ᆲᇂ. ᇕ decomposes to ᆲᆼ. ᇖ decomposes to ᆳᆺ. ᇗ decomposes to ᆯᇫ. ᇘ decomposes to ᆯᆿ. ᇙ decomposes to ᆯᇹ. ᇚ decomposes to ᆷᆨ. ᇛ decomposes to ᆷᆯ. ᇜ decomposes to ᆷᆸ. ᇝ decomposes to ᆷᆺ. ᇞ decomposes to ᇝᆺ. ᇟ decomposes to ᆷᇫ. ᇠ decomposes to ᆷᆾ. ᇡ decomposes to ᆷᇂ. ᇢ decomposes to ᆷᆼ. ᇣ decomposes to ᆸᆯ. ᇤ decomposes to ᆸᇁ. ᇥ decomposes to ᆸᇂ. ᇦ decomposes to ᆸᆼ. ᇧ decomposes to ᆺᆨ. ᇨ decomposes to ᆺᆮ. ᇩ decomposes to ᆺᆯ. ᇪ decomposes to ᆺᆸ. ᇬ decomposes to ᆼᆨ. ᇭ decomposes to ᇬᆨ. ᇮ decomposes to ᆼᆼ. ᇯ decomposes to ᆼᆿ. ᇱ decomposes to ᇰᆺ. ᇲ decomposes to ᇰᇫ. ᇳ decomposes to ᇁᆸ. ᇴ decomposes to ᇁᆼ. ᇵ decomposes to ᇂᆫ. ᇶ decomposes to ᇂᆯ. ᇷ decomposes to ᇂᆷ. ᇸ decomposes to ᇂᆸ. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org