Re: AW: Security concerns: OGHAM SPACE MARK from Asmus Freytag (t) on 2015-07-21 (Unicode Mail List Archive)

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Tue, 21 Jul 2015 16:06:45 -0700

On 7/21/2015 2:55 PM, Dreiheller, Albrecht wrote:

Of course, there are confusables within the Ascii range, but they are well-known for years, and thus more likely to be detected.

Regarding your other example, some compilers warn if you have an assignment within an if-clause.

I used a term "exclusion rules", meaning a ruleset bases on the confusables list.

For example the following code sequence

int a; { int а; a = 5; } (N.B. the second "а" is Cyrillic)

could be banned by a rule saying

"It's not allowed to declare a variable that is DISTINCT from others (thus not hiding them) but which is CONFUSABLY SIMILAR to another variable in the same scope."

Another rule could demand "It's not allowed to mix two alphabets within one name".

This would not ban Cyrillic identifiers in general.

This situation also exists with certain internet zones, where domain names can exist in multiple scripts.

One solution that implements the scheme you suggest is to partition the set of code points into sets of "equivalent" code points - within each set, all code points are confusable (or strongly confusable, whatever your preference).

Then, any identifier using a code point from any set in that position will from that point on "block" any other identifier that is different only by having a different member of that set in the same position.

Done.

Creating the partition requires some care, but implementing the blocking is quite fast - for each set there is one "index" element (the one with the lowest code point value), so all you need to do is to translate all identifiers (internally) to the ones that use the "index" element in each position. That's the identifier you enter into your symbol table. (It's not actually a hash, but it works similarly). If there's another identifier that should be blocked, it would attempt to go into the same slot in your symbol table, but as you have retained the original spelling as well, you can compare that they are not in fact equal, and reject the second one.

On the same principles you could write a validator for source code using non-ASCII labels to make sure there's no abuse.

An additional restriction, as you suggest would be to force each identifier to be in a single script (not alphabet, that restriction isn't very workable). You still need to deal with confusables as outlined above, but the chance that an identifier is unique is bigger.

A./
Received on Tue Jul 21 2015 - 18:08:13 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 21 2015 - 18:08:14 CDT