L2/03-058

Re: Variant Normalization Forms
From: Mark Davis
Date: 2003-02-19

This contains a draft proposal, as per Action 93-A61.

[93-A61] Action Item for Mark Davis, Editorial Committee: Prepare proposed update text for addition to Unicode Standard Annex #15 Unicode Normalization Forms to address the need for tailorings of normalization, for the next UTC meeting.

This turned out to be fairly tricky, when it came to following out all the ramifications in production software. I have to thank Markus Scherer, who did much of the heavy lifting: prototyping and testing both functional consistency and performance, and finding odd edge cases.

I am still uneasy about whether the whole concept of variant normalization forms are a good idea or not, but at least we have a coherent proposal to discuss at the meeting.

Proposal

  1. Insert the section below (Variant Normalization Forms) into UAX #15 at an appropriate point.
  2. Change the text in Annex 8: Detecting Normalization Forms so that it reflects the changes that need to be made for QuickCheck if it is to be used with Variant Normalization Forms.
  3. Modify all UAX #15 Conformance Clauses (and perhaps corresponding changes in Chapter 3) to change "Normalization Form" into "Normalization Form or Variant Normalization Form".

Variant Normalization Forms

The Unicode Technical Committee recognizes that some implementations may need to use variant normalization forms, ones that do not match the standard forms in some way. However, there is a significant danger that inconsistent normalization forms will lead to processing incompatibilities and security flaws. Thus only a small number of such variant normalization forms are defined, and their definition is carefully constrained. The two defined Variant Normalization Forms (VNF) in this version of the Unicode Standard are:

Name Description
VNFC-CI Identical to NFC, except excluding the decomposition of the CJK COMPATIBILITY IDEOGRAPH characters: F900..FA6A, 2F800..2FA1D
VNFD-CI Identical to NFD, except excluding the decomposition of the CJK COMPATIBILITY IDEOGRAPH characters: F900..FA6A, 2F800..2FA1D

While the above are variants of NFC and NFD, there may in the future be variants of NFKC and NFKD.

Note: The range of compatibility characters above includes some characters that do not have decomposition mappings. This is simply to make the ranges more comprehensible; including such characters has no effect since they are already automatically excluded.

Constraints. The constraints on these (and any possible future) VNFs are that they are formed by excluding a standard, consistent, specified set of characters from decomposition and composition steps in Section 5 Specification of UAX #15 and in Chapter 3 of The Unicode Standard. This set is called the exclusion set. The following are the conditions on any Variant Normalization Form:

  1. Exclusion. A Variant Normalization Form V is defined by by the combination of a Normalization Form NF and an exclusion set ES:

  2. Consistency. Variant Normalization Forms are defined in pairs, one for composition and one for decomposition. Each pair is consistent with the other in the following way. If VC is a composition Variant Normalization Form and VD is the corresponding decomposition Variant Normalization Form, then for all strings x and y:

    This implies that if a character A is excluded from being decomposed, then it is also excluded from being the result of any composition (and vice versa). The exclusion set is the same for both composition and decomposition.

  3. Closure. If a character c is in the exclusion set for a given Variant Normalization Form, then any character d must also be in the exclusion set whenever c and d satisfy the following conditions:
    1. c is the result of an NFC composition; i.e., there are some characters f and g such that c = NFC(fg)
    2. d is the result of an NFC composition from a leading c; i.e., there is some character e such that d = NFC(ce)

    Thus if U+00C4 () LATIN CAPITAL LETTER A WITH DIAERESIS is excluded, then U+01DE (Ǟ) LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON must also be excluded.