Re: Programming language identifier normalization/casing

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 29 2001 - 17:50:06 EDT


Achim Ruopp asked:

> Unicode 3.1 Technical report #15, Annex 7
> (http://www.unicode.org/unicode/reports/tr15/#Programming_Language_Ident
> ifiers) contains the following remark:
> "Generally if the programming language has case-sensitive identifiers
> then Normalization Form C may be used, while if the programming language
> has case-insensitive identifiers then Normalization Form KC may be more
> appropriate."
>
> If I'm not mistaken normalization and casing are two independent things.
> So what would be the reason to make this connection?

I think the basic concept is that case-folding is a form of folding.
Normalization is also a form of folding. However, NFC folds away
only canonical distinctions, i.e. those distinctions which are
generally not considered to constitute interpretive differences
in the characters. Case folding *does* fold away interpretive
differences, and in that sense is a *little* bit like the
compatibility folding that goes on in NFKC.

However, my personal opinion is that NFKC folds too many distinctions
for general use -- including distinctions that are contrary to
the expectations even of those who think they want to fold away
compatibility differences in characters. It is somewhat dangerous
in that regard, and is not a very good choice for folding
identifiers. If identifiers for a particular syntax need to fold
away particular distinctions (e.g. full-width versus normal ASCII
characters), they are probably better off specifying such folding
explicitly, rather than depending on NFKC to do it for them.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT