Title: Questions about Tailored Normalization Source: Markus Scherer Date: 2003/02/12 These are questions that came up while designing a working prototype of "tailored normalization" as described in public issue 7. In addition to what I sent to unicore on 2003-feb-11 (attached below) I would like to mention another issue: Test data. NormalizationTest.txt is a large file with conformance test data. It proved to be extremely useful and important for verifying efficient implementations of Unicode normalization, especially for edge cases. Adding even a small set of predefined tailorings should be accompanied with specific test data. This will be large to be useful. For arbitrary tailorings, if such are to be allowed, the UAX #15 sample code should be extended to allow for such tailorings as well (which is in any case a good idea). Extended sample code allows to cross-check results with production implementations. -------- Original Message -------- Subject: questions about tailored normalization Date: Tue, 11 Feb 2003 10:20:20 -0800 Hello, I will try to prototype tailored normalization for evaluation of public issue 7 (http://www.unicode.org/review/). Thinking about how this would fit (or not) into ICU, I have some practical questions. My current thinking is to take the ICU normalization code and add a UnicodeSet pointer parameter to internal functions. The UnicodeSet pointer would be NULL for untailored normalization, or else point to the set of code points that are to be excluded from decomposition. Questions: First the semantics. With a naive/intuitive implementation as mentioned above, I believe that NF*C (=NFC or NFKC) with decomposition exclusions would in fact still be a Normalization Form, i.e., produce unique-form strings regardless of input. However, NF*D with decomposition exclusions as above would not be an NF, would not produce unique strings from canonically equivalent input. If I exclude Š from decomposition and perform NFD%DX[Š] on strings <Š> and then neither string is modified. They remain different although they are canonically equivalent. In order to make NF*D%DX true normalization forms, they would have to actually do _some_ composition work; in the example, they would have to compose a+" to Š. This is a significant, non-trivial modification of an NF*D implementation. (I suppose that a slow implementation could apply NF*C first, then decompose with exclusions. A faster implementation would have to run a composition step where the inverse of the decomposition exclusion set becomes a composition exclusion set... not sure either would work...) So the first question is really whether NF*D are to be tailored as well, and if so, if NF*D with decomposition exclusions are intended to be Normalization Forms. Of course, one way out of this question is to forbid decomposition exclusion sets with code points that are composition targets (-> allow only code points without decompositions and ones that have the Full_Composition_Exclusion property [which is true for singleton decompositions]). For example, CJK compatibility ideographs would be allowed, but Hangul syllables would not be. ---- In terms of API, as a developer I will eventually need some idea of the variability of the tailoring. Is it expected that there will be very few predefined decomposition exclusion sets? For example, if there will be no more than, say, 4 sets that can also be combined with each other, I could use a small bit set to specify them, and efficiently cache them and all unions of them. One bit for Hangul, one bit for CJK compatibility characters, ... Side-question: Is there/will there be a property that lists CJK compatibility characters? Can I compute the set by (Ideographic && hasCanonicalDecomposition)? If there will be many decomposition exclusion sets, or if this is expected to be up to the user, then it might be better to provide an API that takes a set parameter directly. ---- I would like to mention a concern that I have with tailored normalization: API bloat. There are some not-so-obvious functions that use normalization in one way or another, and many of them do not currently have any parameters that could be co-opted for this. I guess that other libraries would have this problem, too. In ICU, many functions are not methods on any object, but are simple, stateless C functions or static C++/Java methods. This worked so far because there was no state (or fancy options) to be kept. Examples for such functions, other than the core normalize(): - normalization quick check - compare strings under canonical equivalence (not affected?) - concatenate strings while preserving a given NF - computing canonical closure for a string There are other APIs where we do have options-carrying service objects, like collation. _If_ we decide to put this into ICU, my current guess is that we will start with just a couple of core API functions. Thanks, markus