Re: Mixed-Script confusables in prog.languages

From: gfb hjjhjh <c933103_at_gmail.com>
Date: Mon, 5 Dec 2016 20:51:56 +0800

How about package names like ロシアМС21(Note the МС are Cyrillic), or πr²の秘密,
or エリ_хорошо_μ'sic_4⃣ever? Although they aren't really names that people
would usually use in package/var names, they are meaningful names...

2016年12月5日 16:39 於 "Reini Urban" <reini_at_cpanel.net> 寫道:

>
> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
> >
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini_at_cpanel.net> wrote:
> >
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.
> >
> > That doesn't catch bidi spoofs.
>
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
>
> i.e. ‮goog‬le <U+202E (right-to-left override), g, o, o, g, U+202C (pop
> directional formatting), l, e>
> is already caught as illegal.
>
> Mixing RTL scripts, such as Arabic with Latin is not caught with the
> mixed-script rule per se.
>
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error on
> >> any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >>
> >> in perl like this:
> >> use utf8 ‘Greek’, ‘Cyrillic’;
> >
> > Your rule isn't clear. Would an identifier like ψ_S be automatically
> > allowed?
>
> ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are
> always allowed, the only
> new script is Greek. The first non-default script is automatically and
> silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit
> declaration.
>
> So ψ_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
>
> > I presume you're handling the spoofing of the SMALL PHI characters by
> > other means.
>
> The spoof attempt would be ѱ_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters
> have confusable
> Cyrillic counterparts, that’s why a declaration of use utf8 ‘Greek’,
> ‘Cyrillic’;
> i.e. mixing those two sounds highly dangerous.
> With the UCD confusable table this would be an error. In my rule not,
> since the user
> declared those two scripts to be mixed.
>
> > For multilingual support, you would want rules more like
> >
> > 'After script X, allow script Y’.
>
> Can you expand on that with an example? I’m no expert on this.
>
> Like after Hangul, allow Han? After Hiragana, allow Katakana?
>
> >> Of course there exist several languages which require more than one
> >> script,
> > <snip>
> >> or african languages as some have other than Latin roots, e.g.
> >> Ethiopian from Semitic.
> >
> > I don't see your problem here. What problem do you see with Amharic?
>
> Amharic is not defined as UCD script property. It’s alphabet is called
> Ge’ez, which we call
> Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does
> Ethiopic uses
> other Semitic scripts in its alphabet or is it complete? I learned some
> CFK languages,
> where you historically allow mixed scripts. But for other scripts I’m
> clueless.
> The examples I got mix it with Runic. Valid or nonsense?
>
> The problem is to decide which scripts are commonly mixed in which
> languages to allow
> them to be valid identifiers.
>
> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don’t
> mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall
> back to english to
> get common points across the barrier. But their scripts? No idea.
>
> >
> >> Indian languages also sound problematic,
> >
> > Is this the ZWJ/ZWNJ issue? That surely is a problem within a script.
> >
> >> and
> >> all the Old_<script>
> >
> > Now I am confused. What problem do you see that you don't have in the
> > Latin script?
>
> That I have no idea if those Old_<script> alphabets are still in use to
> create
> aliases for them.
> In the examples in perl which partially came from parrot there’s a wild
> eclectic mix of various scripts
> which do make no sense at all. So I don’t know if I can trust those tests,
> that they make sense and
> are readable at all. My guess is that the authors just liked code golfing
> and picked random unicode
> characters. It’s from perl after all.
>
> Such as this perl test t/mro/isa_c3_utf8.t
>
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
> Hiragana );
>
> ...
> package 캎oẃ;
> package urḲḵk;
> @urḲḵk::ISA = 'kഌoんḰ';
> package к;
> @urḲḵk::ISA = ('kഌoんḰ', '캎oẃ');
> package ṭ화ckэ;
> ...
>
> These identifiers are unreadable, because I don’t assume that anybody will
> be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at
> once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds
> highly illegal to me.
>
> So my rule makes sense. You need to declare non-default scripts used in
> your identifiers if mixed.
> (not strings. these can be everything, even illegal UTF-8).
>
>
>
Received on Mon Dec 05 2016 - 06:52:38 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 05 2016 - 06:52:38 CST