Title: Questions about Tailored Normalization
Source: Markus Scherer
Date: 2003/02/12

These are questions that came up while designing a working prototype 
of "tailored normalization" as described in public issue 7.

In addition to what I sent to unicore on 2003-feb-11 (attached below) 
I would like to mention another issue: Test data.

NormalizationTest.txt is a large file with conformance test data. It 
proved to be extremely useful and important for verifying efficient 
implementations of Unicode normalization, especially for edge cases.

Adding even a small set of predefined tailorings should be accompanied 
with specific test data. This will be large to be useful.

For arbitrary tailorings, if such are to be allowed, the UAX #15 
sample code should be extended to allow for such tailorings as well 
(which is in any case a good idea). Extended sample code allows to 
cross-check results with production implementations.

-------- Original Message --------
Subject: questions about tailored normalization
Date: Tue, 11 Feb 2003 10:20:20 -0800

Hello, I will try to prototype tailored normalization for evaluation 
of public issue 7
(http://www.unicode.org/review/). Thinking about how this would fit 
(or not) into ICU, I have some
practical questions.

My current thinking is to take the ICU normalization code and add 
a UnicodeSet pointer parameter to
internal functions. The UnicodeSet pointer would be NULL for 
untailored normalization, or else point
to the set of code points that are to be excluded from decomposition.

Questions:

First the semantics. With a naive/intuitive implementation as 
mentioned above, I believe that NF*C
(=NFC or NFKC) with decomposition exclusions would in fact still 
be a Normalization Form, i.e.,
produce unique-form strings regardless of input.

However, NF*D with decomposition exclusions as above would not be 
an NF, would not produce unique
strings from canonically equivalent input. If I exclude Š from 
decomposition and perform NFD%DX[Š]
on strings <Š> and <a"> then neither string is modified. They 
remain different although they are
canonically equivalent. In order to make NF*D%DX true normalization 
forms, they would have to
actually do _some_ composition work; in the example, they would 
have to compose a+" to Š. This is a
significant, non-trivial modification of an NF*D implementation.

(I suppose that a slow implementation could apply NF*C first, 
then decompose with exclusions. A
faster implementation would have to run a composition step where 
the inverse of the decomposition
exclusion set becomes a composition exclusion set... not sure 
either would work...)

So the first question is really whether NF*D are to be tailored 
as well, and if so, if NF*D with
decomposition exclusions are intended to be Normalization Forms.

Of course, one way out of this question is to forbid decomposition 
exclusion sets with code points
that are composition targets (-> allow only code points without 
decompositions and ones that have
the Full_Composition_Exclusion property [which is true for singleton 
decompositions]). For example,
CJK compatibility ideographs would be allowed, but Hangul
syllables would not be.

----
In terms of API, as a developer I will eventually need some idea 
of the variability of the
tailoring. Is it expected that there will be very few predefined 
decomposition exclusion sets?

For example, if there will be no more than, say, 4 sets that can 
also be combined with each other, I
could use a small bit set to specify them, and efficiently cache 
them and all unions of them. One
bit for Hangul, one bit for CJK compatibility characters, ...

Side-question: Is there/will there be a property that lists CJK 
compatibility characters?
Can I compute the set by (Ideographic && hasCanonicalDecomposition)?

If there will be many decomposition exclusion sets, or if this is 
expected to be up to the user,
then it might be better to provide an API that takes a set parameter 
directly.

----
I would like to mention a concern that I have with tailored normalization: 
API bloat. There are some
not-so-obvious functions that use normalization in one way or another, 
and many of them do not
currently have any parameters that could be co-opted for this. I guess 
that other libraries would
have this problem, too.

In ICU, many functions are not methods on any object, but are simple, 
stateless C functions or
static C++/Java methods. This worked so far because there was no 
state (or fancy options) to be
kept. Examples for such functions, other than the core normalize():
- normalization quick check
- compare strings under canonical equivalence (not affected?)
- concatenate strings while preserving a given NF
- computing canonical closure for a string

There are other APIs where we do have options-carrying service objects, 
like collation.

_If_ we decide to put this into ICU, my current guess is that we will 
start with just a couple of
core API functions.

Thanks,

markus