L2/00-191 From: Asmus Freytag Date: June 20, 2000 Normalization and case folding for identifier matching At 09:32 AM 6/17/00 -0800, Mark Davis wrote: >My view is that NFKC is generally appropriate for cases where identifiers >are case-insensitive, but otherwise reasonable people may disagree with me The issue with the 'K' forms of the Normalization is twofold: 1) the set of compatibility mappings in Unicode 3.0 has 16 different sub-types, reflecting a wide variety of relations between characters and their 'compatibility equivalents'. Because of this wide range, it's harder for implementers to understand the consequences of applying forms K, compared to say, case folding. 2) some sub-types of compatibility mappings appear consistent in Version 3.0, but will look screwy when taking into account the imminent extensions. The existing characters for mathematical variables would be folded, but the characters to be added would not. Black Letter H would be, but Fraktur D would not. However, there are some sub-types of compatibility mappings for which Mark's oft-repeated "they are just formatting differences" would be quite valid (half-width/full-width and no-break come to mind). There are additional sub-types that have 'loss-less' compatibility mappings, and therefore are best folded (I like to think of these as 'near canonical equivalents). I'm of course referring to the initial/ medial/ final/ isolated Arabic letter variants. One could argue that the mappings belong here as well. The correct approach then would be to suggest the use of a different normalization form, one that makes exceptions for some of more problematic sub-types of compatibility mappings. I like to call this form "KR" for "Kompatibility with Restraint". I'm not sure whether we can fix the existing forms K. I understand that the *canonical* form C has been endorsed by the W3C and needs therefore to adhere to the stability guarantee that was made at the time. I am not aware that such external normative reference exists to forms K. However, nothing prevents UTC from doing the right thing, defining forms KR, if necessary as new normalization forms, and to stop endorsing or recommending the problematic forms K in their existing blanket form. Specifically: Forms KR would include these compatibility sub-types: Forms KR would exclude these compatibility sub-types: (*) see footnote The sub-type, being the 'grab-bag' of characters with compatibility relations that are not further specified, and in some cases even questionable (2107) would need to be analyzed once, in case-by-case approach. Some examples: Roman Numerals: KR Parenthesized: KR CJK and Radicals compats: KR Dotted Alphanumerics: probably KR Ligatures: probably KR Telegraph symbols: probably KR Euler Constant: not-KR Alef Symbol, etc.: not-KR Spacing accents (mapped to SP + combining accents): ?? etc, etc. A./ (*) I thought about this one for some time. Dropping the circle, i.e. mapping (20) to 20 and forms K do, can lead to the suddenly 'bare' numbers or letters to coalesce with adjacent words or numbers. That would be truly counter intuitive to the user and is therefore best avoided. This issue does not apply to the parenthesized composites.