Re: Rationale wanted for Unicode identifier rules

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 01 2000 - 15:28:08 EST


Addison Phillips noted:

> The only things off-the-top-of-my-head that I can think of here
> is that we might want to prevent certain "equivalent"
> characters or compatibility characters from being used in identifiers.
> In other words, if you pass the code text through a normalization
> the identifiers should all be legal.
>
> The point here is that the use of combining characters versus
> precomposed characters should not result in *separate* identifiers:
> if it looks the same on the screen it should be the same to the
> compiler. This implies normalizing the text as a precondition to
> lexing and depending on which normalization form you choose the
> punctuation and other characters could be normalized into illegal
> sequences... so not everything above U+00A0 is legal.

This is a good point. Normalization will be required for lexers (either
early or late). And it would be inadvisable to allow identifier syntax
to run afoul of normalization, in cases where compatibility decompositions
could result in illegal sequences.

In my opinion, the best results are obtained by choosing:

  A. Identifier syntax along the lines described in Unicode 3.0.

  B. Normalization Form C or Form D. (no compatibility decompositions)

This combination of choices keeps many compatibility characters (of the
miscellanous CJK symbol type, for example) out
of identifiers, and keeps *all* compatibility decompositions out of
the normalization. Since only canonical decomposition (and possibly
canonical recomposition, for Form C) is being done, there are no
instances that I know of where the identifier syntax should not stay
closed over normalization.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT