Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Thu, 7 Jun 2018 10:38:50 -0700
On 6/7/2018 9:01 AM, Alastair Houghton via Unicode wrote:
But please don’t misunderstand; I am not — and have not been — arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might call them “trade-offs”). We can debate the severity of them, and whether, and what, it’s worthwhile doing anything to mitigate any of them. What we shouldn’t do is sweep them under the carpet.


Once you go beyond ASCII (or really any small well-known set of shapes) to a very large universe like Unicode, you will lose something and gain something.

You will gain being able to express some things more like they would be written in ordinary text.

You will lose by having identifiers that can be more ambiguous, harder to recognize / replicate / display and so on.

Where identifiers are "private", say limited to the source code of a single application, the user/author is in control and can avoid problematic cases. However, even for source code, not all identifiers are truly private. Names for classes and modules get turned into filenames, modules may be shared, etc.

Code is also shared. If you use code that ostensibly calls on a public library or module, but your identifier system allows spoofing, your use of shared code may access malicious code hiding behind lookalike names.

Unicode has dozens of character pairs that look absolutely identical by design (http://www.unicode.org/Public/security/10.0.0/intentional.txt gives a subset of these), and many more combinations that could look identical in any given font (but don't necessarily are so in every font). Many of the latter are combining sequences that are not normalized.

For many complex scripts, not all possible orderings of code points are well-behaved. Some may not render on certain platforms / devices, while others do. Sometimes, two alternative orderings will look the same.

Not paying attention to these issues will cause your identifier system to be ill-behaved whenever it "leaks" into public identifier space, particularly when your identifiers become file names or names of network resources because you want to allow sharing of libraries or modules.

---

The main point about allowing identifiers to look like words is to make them mnemonic. Non-ASCII identifiers can be more mnemonic to those that use other scripts. However, one does not need to allow the full Unicode range unfiltered in order to achieve mnemonic labels. There are many things in Unicode needed for very specialized texts and while one can always imagine some specialist delighting in writing a program where some object is spelled precisely like it is in the "real" world, there really is no need to allow such edge cases to undermine the stability and security of a "reasonably" mnemonic system for the wider body of users.

To give an example: in Arabic, you can disallow *all* combining marks, and still get a strong identifier system. In fact, it will be stronger, because many accidental similarities to letter shapes will be eliminated (and it's not necessary to devise some complex folding).

However, in Arabic again, there are pairs of digits that look identical (and have the same numeric value). Allowing these into identifiers without some folding would make it impossible for users to know (without looking at the underlying bits) how to type an identifier containing one of them.

Further in Arabic, several letter characters may be different when in some positions in a word, but identical if in another position in a word. Again, without some folding there's no way you'll ever know which one.

A./

PS: for the past several years, I've been part of a project that seeks to extend the types of domain names for top level domains to extend beyond ASCII. To get an idea of what that entails, check out https://icann.org/idn and look for "Root Zone Label Generation Rules", for example Arabic (https://www.icann.org/sites/default/files/lgr/lgr-2-arabic-script-26jul17-en.html). For a detailed discussion of the design, see (https://www.icann.org/en/system/files/files/arabic-lgr-proposal-18nov15-en.pdf).

These are for Root Zone identifiers, which exclude digits for example, so you won't find discussion of digit-related issues. You also won't find mention of any "foldings", but that is because the Root Zone uses a related concept of "variant". For a programming language, some custom folding would probably be a better approach as implementation of variants implicitly requires a registry.
Received on Thu Jun 07 2018 - 12:38:54 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 07 2018 - 12:38:54 CDT