But please don’t misunderstand; I am not — and have not been — arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might call them “trade-offs”). We can debate the severity of them, and whether, and what, it’s worthwhile doing anything to mitigate any of them. What we shouldn’t do is sweep them under the carpet.
Once you go beyond ASCII (or really any
small well-known set of shapes) to a very large universe like
Unicode, you will lose something and gain something.
You will gain being able to express some things more like they would be written in ordinary text.
You will lose by having identifiers that can be more ambiguous, harder to recognize / replicate / display and so on.
Where identifiers are "private", say limited to the source code of a single application, the user/author is in control and can avoid problematic cases. However, even for source code, not all identifiers are truly private. Names for classes and modules get turned into filenames, modules may be shared, etc.
Code is also shared. If you use code that ostensibly calls on a public library or module, but your identifier system allows spoofing, your use of shared code may access malicious code hiding behind lookalike names.
Unicode has dozens of character pairs that
look absolutely identical by design
(http://www.unicode.org/Public/security/10.0.0/intentional.txt
gives a subset of these), and many more combinations that could
look identical in any given font (but don't necessarily are so
in every font). Many of the latter are combining sequences that
are not normalized.
For many complex scripts, not all possible
orderings of code points are well-behaved. Some may not render
on certain platforms / devices, while others do. Sometimes, two
alternative orderings will look the same.
Not paying attention to these issues will cause your identifier system to be ill-behaved whenever it "leaks" into public identifier space, particularly when your identifiers become file names or names of network resources because you want to allow sharing of libraries or modules.
---
The main point about allowing identifiers to look like words is to make them mnemonic. Non-ASCII identifiers can be more mnemonic to those that use other scripts. However, one does not need to allow the full Unicode range unfiltered in order to achieve mnemonic labels. There are many things in Unicode needed for very specialized texts and while one can always imagine some specialist delighting in writing a program where some object is spelled precisely like it is in the "real" world, there really is no need to allow such edge cases to undermine the stability and security of a "reasonably" mnemonic system for the wider body of users.
To give an example: in Arabic, you can disallow *all* combining marks, and still get a strong identifier system. In fact, it will be stronger, because many accidental similarities to letter shapes will be eliminated (and it's not necessary to devise some complex folding).
However, in Arabic again, there are pairs of digits that look identical (and have the same numeric value). Allowing these into identifiers without some folding would make it impossible for users to know (without looking at the underlying bits) how to type an identifier containing one of them.
Further in Arabic, several letter characters
may be different when in some positions in a word, but identical
if in another position in a word. Again, without some folding
there's no way you'll ever know which one.
This archive was generated by hypermail 2.2.0 : Thu Jun 07 2018 - 12:38:54 CDT