Canonicalization of Unicode for DNS

L2/00-288

Canonicalization Model for
Internationalized DNS

Draft: Mark Davis

Date: August 28, 2000

The Unicode Technical Committee discussed the issue of DNS names at its meeting August 8-11,2000, and had some recommendations. I will try to summarize the points brought out during the discussion. If you have any questions, please let me know.

1. The committee is in favor of the canonicalization model of:

Filter - Fold - NFKC - Serialize

· The filter step would reject certain characters, thus causing the name to be illegal.

· The fold step would fold characters together (e.g. case mapping), or fold characters away (e.g. delete a character by folding it to a null string).

· The NFKC step puts the string into normalized form, as defined by UTR#15

· The Serialize produces a reversible mapping to a sequence of bytes.

2. Filter Closure

The Filter must be closed under folding and normalization. That is, suppose that

NFKC( Fold( Filter( x ) ) contains y. Then, if ( isFiltered( y )== REJECT ) then

( isFiltered( x ) == REJECT ).

In the process of developing the filter, you start with an original filter, and programmatically add all characters that would be canonicalized into characters that would be rejected by the original filter. This is, of course, not done at runtime; it is just a formal constraint on the Filter.

Thus if the original filter rejects U+2044 FRACTION SLASH, then the closure of that filter must reject U+00BC VULGAR FRACTION ONE QUARTER (since the latter is canonicalized to a string that contains U+2044).

3. For Fold, the candidates include:

Case: use case folding

data:

http://www.unicode.org/Public/3.0-Update1/CaseFolding-2d1.beta.txt

visual:

http://www.unicode.org/unicode/reports/tr21/charts/

Dashes: map characters with General Category Pd to U+002D

data:

http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt

visual:

http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart14.html

Spaces: map characters with General Category Zs to U+0020*

data:

http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt

visual:

http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart7.html

* Only necessary if these are not filtered.

(The UTC did not discuss whether it would be advisable to fold away Hebrew accents or points marks: the characters in:

U+0591 HEBREW ACCENT ETNAHTA

U+05C4 HEBREW MARK UPPER DOT

see #6).

4. Serialize.

The UTC did not discuss the Serialize phase in any detail. In general, the consortium does not favor new transformations (beyond UTF-8, UTF-16, and UTF-32). However, it recognizes that there may be additional constraints in particular environments such as for DNS names that warrant using a novel transformation, such as a base-36 approach.

5. Language-independence.

Based on the extensive internationalization experience of its membership, the technical committee believes strongly that having either language-dependent canonicalization or allowing multiple character encodings would be disastrous. The committee recognizes that the canonicalization may not be optimal for all languages, but

(a) the benefits of uniformity far outweigh the drawbacks in a few cases,

(b) there are work-arounds in many cases, © usersarealreadyusedtorestrictionson DNS names that in most cases represent far more problems for legibility (e.g. the lack of space).

In the case of (b) for example, French users may be accustomed to either having accents or not in uppercase. Yet in other languages, the distinction between accented letters must be maintained—folding them would be like folding all vowels to E in all English words. An acceptable work-around is to register both the name with all accents and the name with none.

6. Subcommittee

A subcommittee was formed to look at this issue in more detail; in particular, recommendations for Filter (and perhaps Fold). The committee agreed that the inclusion of characters into these steps must be based on principles, so that as new characters are added to the standard, the principles can be applied to those new characters.

Caution:

1. The whole canonicalization process, as outlined above destroys information. That is, case folding (or folding dashes) also destroys information: there is no way to recover the original case information once folded. Filtering also “destroys” information in a sense, by disallowing certain characters. Disallowing spaces, for example, very much alters the allowable text.

2. The canonicalization process is not designed to be applied to arbitrary text. It is designed to be applied to identifiers, or similarly constrained environments where not all characters are allowed.

Superscript 2 is not a problem, because it would be filtered out before it is ever normalized (see earlier messages about NFKC with identifiers).

3. For arbitrary text — not identifiers — NFC is the correct normalization to use.