Re: CLDR ExemplarCharacters Data for Identity Management Data Validation Rules

From: Thierry Moreau (thierry.moreau@connotech.com)
Date: Wed May 13 2009 - 22:02:21 CDT

Next message: JoÃ³ ÃdÃ¡m: "Re: So much for accuracy!"

Previous message: Behnam: "Fwd: So much for accuracy!"
In reply to: Thierry Moreau: "CLDR ExemplarCharacters Data for Identity Management Data Validation Rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

See below for an update ...

I wrote:

>
>
> This post is a general question about the design of validation logic for
> an identity management application.
>
> This is somehow related to IDN validity rules, but with slightly
> different application requirements.
>
> In UTR#36 (Unicode Security Considerations) Annex G (Language-Based
> Security) was published a few days after CLDDR version 1.4 was released,
> in 2006-07. The text of this annex recommends the use of Unicode scripts
> as a basis for name validation rules, and recommends writing systems
> instead of languages as a refined strategy.
>
> In the meantime, the CLDR project moved to version 1.6 (and 1.6.1) and
> improved "data on language and script usage" (presumably this covers
> exemplarCharacters).
>
> The main question is whether UTR#36 / Annex G advice *against* using
> CLDR data for validation rules (e.g. for security-aware applications
> e.g. where identity spoofing is a threat) has been revisited by someone.
>
> So far, my investigations along these lines indicate that it should be
> feasible to combine Unicode script information and CLDR
> exemplarCharacters data with a lot of adjustments (e.g. to remove
> historic or phonetic scripts) to come up with language-specific rules
> for what is an acceptable identity in a given language (actually the
> rules may apply to personal identification data elements such as place
> of birth). Obviously, such validation applies to normalized strings.
>
> Any comment or suggestion?
>
> Thanks in advance.
>

In the absence of feedback, here is an update.

CLDR data (exemplarCharacters) is useful as a source of knowledge for
automatic validation rules for data strings in the context of an
internationalized identity management application.

In the course of this investigation, I merged the CLDR data to Unicode
character database in a MySQL database where reconciliation was
possible. I ignored complex languages (CJK, i.e. ja, ko, and zh, and
also mn in the Mongolian script), and languages or scripts of historic,
phonetic, artificial, or very minor interest (cop, el_POLYTON, en_Dsrt,
en_Shaw, cch, eo, gv, ia, and trv).

This leaves 126 languages using 27 scripts. Six languages are using more
than one script (according to CLDR data, many more exist without being
reflected in the CLDR). The Latin script is used by 77 languages.

One source of uncertainty lies with the "auxiliary" examplarCharacters.
The most paradoxical case is the Serbian language using the Cyrillic
script: including the "auxiliary" collection in the allowed characters,
there are more diacritical marks allowed than if Serbian is written
using the Latin script! The lesson I get is that "auxiliary" collections
might be left out (as invalid characters for personal identification
data strings). After all, erring on the strictier side is perhaps not so
bad in an identity management application.

Incidentally, "Internationalization in the courtroom" occurred in my
jurisdiction: a Portugese diacritical mark was ordered by a Judge for an
official birth registration. But the ruling got reversed by the
legislative assembly, and now only French diacritical marks are allowed
by law on proper names of persons in the province of Québec (this
amendment to the law must have been brought by civil servants without
debate by members of parliament, as a trivial technicality). Therefore,
there might be some wisdom in the restriction of valid characters to
CLDR exemplarCharacters without the "auxiliary" collection.

The next issue I face is the normalization of punctuation marks and digits.

I didn't check how this relates to IDN issues with the solution design
perspective.

You are welcome to provide feedback!

Regards,

-- 
- Thierry Moreau
CONNOTECH Experts-conseils inc.
9130 Place de Montgolfier
Montreal, Qc
Canada   H2M 2A1
Tel.: (514)385-5691
Fax:  (514)385-5900
web site: http://www.connotech.com
e-mail: thierry.moreau@connotech.com

Next message: JoÃ³ ÃdÃ¡m: "Re: So much for accuracy!"
Previous message: Behnam: "Fwd: So much for accuracy!"
In reply to: Thierry Moreau: "CLDR ExemplarCharacters Data for Identity Management Data Validation Rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 13 2009 - 22:09:47 CDT