L2/00-191

From:  Asmus Freytag
Date:  June 20, 2000

Normalization and case folding for identifier matching

At 09:32 AM 6/17/00 -0800, Mark Davis wrote:
>My view is that NFKC is generally appropriate for cases where identifiers
>are case-insensitive, but otherwise reasonable people may disagree with me


The issue with the 'K' forms of the Normalization is twofold:

1) the set of compatibility mappings in Unicode 3.0 has 16 different
    sub-types, reflecting a wide variety of relations between characters and
    their 'compatibility equivalents'. Because of this wide range, it's
    harder for implementers to understand the consequences of applying
    forms K, compared to say, case folding.

2) some sub-types of compatibility mappings appear consistent in Version 3.0,
    but will look screwy when taking into account the imminent extensions.
    The existing characters for mathematical variables would be folded, but
    the characters to be added would not. Black Letter H would be, but Fraktur
    D would not.

However, there are some sub-types of compatibility mappings for which
Mark's oft-repeated "they are just formatting differences" would be quite
valid (half-width/full-width and no-break come to mind).

There are additional sub-types that have 'loss-less' compatibility
mappings, and therefore are best folded (I like to think of these as 'near
canonical equivalents). I'm of course referring to the  initial/ medial/
final/ isolated Arabic letter variants. One could argue that the <fraction>
mappings belong here as well.

The correct approach then would be to suggest the use of a different
normalization form, one that makes exceptions for some of more problematic
sub-types of compatibility mappings. I like to call this form "KR" for
"Kompatibility with Restraint".

I'm not sure whether we can fix the existing forms K. I understand that the
*canonical* form C has been endorsed by the W3C and needs therefore to
adhere to the stability guarantee that was made at the time. I am not aware
that such external normative reference exists to forms K. However, nothing
prevents UTC from doing the right thing, defining forms KR, if necessary as
new normalization forms, and to stop endorsing or recommending the
problematic forms K in their existing blanket form.

Specifically:

Forms KR would include these compatibility sub-types:

<initial>
<medial>
<final>
<isolated>
<no-break>
<narrow>
<wide>
<vertical>
<small>
<square>
<fraction>


Forms KR would exclude these compatibility sub-types:
<font>
<super>
<sub>
<circle> (*) see footnote

The <compat> sub-type, being the 'grab-bag' of characters
with compatibility relations that are not further
specified, and in some cases even questionable (2107) would need to be
analyzed once, in case-by-case approach. Some examples:

Roman Numerals: KR
Parenthesized: KR
CJK and Radicals compats: KR

Dotted Alphanumerics: probably KR
Ligatures: probably KR
Telegraph symbols: probably KR

Euler Constant: not-KR
Alef Symbol, etc.: not-KR

Spacing accents (mapped to SP + combining accents): ??

etc, etc.

A./

(*) I thought about this one for some time. Dropping the circle, i.e.
mapping (20) to 20 and forms K do, can lead to the suddenly 'bare' numbers
or letters to coalesce with adjacent words or numbers. That would be truly
counter intuitive to the user and is therefore best avoided. This issue
does not apply to the parenthesized composites.