Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

From: Frédéric Grosshans via Unicode <>
Date: Thu, 7 Jun 2018 19:47:02 +0200
Le 07/06/2018 à 18:01, Alastair Houghton a écrit :

I appreciate that the upshot of the Anglicised world of software engineering is that native English speakers have an advantage, and those for whom Latin isn’t their usual script are at a particular disadvantage, and I’m sure that seems unfair to many of us — but that doesn’t mean that allowing the use of other scripts everywhere, desirable as it is, is entirely unproblematic.
It depends of what what means by “allowing”, and it clearly can be clearly problematic to use non ASCII characters. Restriction to (a subset of) ASCII is indeed often the most reasonable choice, but when on writes a specification on something which can be used in many contexts (like url addresses, or a programming language), not allowing it means forbidding it, even in contexts where it makes sense.
If I understand you correctly, an Arabic speaker should always transliterate the function name to ASCII,

That’s one option; or they could write it in Arabic, but they need to be aware of the consequences of doing so (and those they are working for or with also need to understand that) [...];
We agree on this: they should be aware of the consequences. I think these consequences should be essentially societal (as the example you give), but not technical, since the first ones are supposed to be well understood by everyone.


 UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker.
In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it is not an artificial problem: I’ve once had some difficulties with an automatically generated login which was do11y but tried to type dolly, despites my familiarity with ASCII. So I guess this problem is not specific to the ASCII vs non-ASCII debate

It isn’t, though fonts used by programmers typically emphasise the differences between I, l and 1 as well as 0 and O, 5 and S and so on specifically to avoid this problem.
In your example, you specifically mentioned that it “might be awkward in certain fonts” but “OK in monospaced font”, so nothing ASCII specific here.

But please don’t misunderstand; I am not — and have not been — arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might call them “trade-offs”). We can debate the severity of them, and whether, and what, it’s worthwhile doing anything to mitigate any of them. What we shouldn’t do is sweep them under the carpet.

I totally agree. (And I misunderstood you in the first place, probably because “non-ASCII is bad, whatever the context” is a common attitude in programmers, even non-Latin native ones.

Personally I think a combination of documentation to explain that it’s worth thinking carefully about which script(s) to use, and some steps to consider certain characters to be equivalent even though they aren’t the same (and shouldn’t be the same even when normalised) might be a good idea. Is that really so controversial a position?

Not at all. I misread “for reasonably wide values of ‘everyone’, at any rate…” as saying “it is unreasonable to think of people not comfortable with ASCII”, but it is clearly not what you intend to say.

We both agree that:


Received on Thu Jun 07 2018 - 12:47:24 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 07 2018 - 12:47:24 CDT