Re: CLDR ExemplarCharacters Data for Identity Management Data Validation Rules

From: Thierry Moreau (thierry.moreau@connotech.com)
Date: Wed May 13 2009 - 22:02:21 CDT

  • Next message: Joó Ádám: "Re: So much for accuracy!"

    See below for an update ...

    I wrote:

    >
    >
    > This post is a general question about the design of validation logic for
    > an identity management application.
    >
    > This is somehow related to IDN validity rules, but with slightly
    > different application requirements.
    >
    > In UTR#36 (Unicode Security Considerations) Annex G (Language-Based
    > Security) was published a few days after CLDDR version 1.4 was released,
    > in 2006-07. The text of this annex recommends the use of Unicode scripts
    > as a basis for name validation rules, and recommends writing systems
    > instead of languages as a refined strategy.
    >
    > In the meantime, the CLDR project moved to version 1.6 (and 1.6.1) and
    > improved "data on language and script usage" (presumably this covers
    > exemplarCharacters).
    >
    > The main question is whether UTR#36 / Annex G advice *against* using
    > CLDR data for validation rules (e.g. for security-aware applications
    > e.g. where identity spoofing is a threat) has been revisited by someone.
    >
    > So far, my investigations along these lines indicate that it should be
    > feasible to combine Unicode script information and CLDR
    > exemplarCharacters data with a lot of adjustments (e.g. to remove
    > historic or phonetic scripts) to come up with language-specific rules
    > for what is an acceptable identity in a given language (actually the
    > rules may apply to personal identification data elements such as place
    > of birth). Obviously, such validation applies to normalized strings.
    >
    > Any comment or suggestion?
    >
    > Thanks in advance.
    >

    In the absence of feedback, here is an update.

    CLDR data (exemplarCharacters) is useful as a source of knowledge for
    automatic validation rules for data strings in the context of an
    internationalized identity management application.

    In the course of this investigation, I merged the CLDR data to Unicode
    character database in a MySQL database where reconciliation was
    possible. I ignored complex languages (CJK, i.e. ja, ko, and zh, and
    also mn in the Mongolian script), and languages or scripts of historic,
    phonetic, artificial, or very minor interest (cop, el_POLYTON, en_Dsrt,
    en_Shaw, cch, eo, gv, ia, and trv).

    This leaves 126 languages using 27 scripts. Six languages are using more
    than one script (according to CLDR data, many more exist without being
    reflected in the CLDR). The Latin script is used by 77 languages.

    One source of uncertainty lies with the "auxiliary" examplarCharacters.
    The most paradoxical case is the Serbian language using the Cyrillic
    script: including the "auxiliary" collection in the allowed characters,
    there are more diacritical marks allowed than if Serbian is written
    using the Latin script! The lesson I get is that "auxiliary" collections
    might be left out (as invalid characters for personal identification
    data strings). After all, erring on the strictier side is perhaps not so
    bad in an identity management application.

    Incidentally, "Internationalization in the courtroom" occurred in my
    jurisdiction: a Portugese diacritical mark was ordered by a Judge for an
    official birth registration. But the ruling got reversed by the
    legislative assembly, and now only French diacritical marks are allowed
    by law on proper names of persons in the province of Qubec (this
    amendment to the law must have been brought by civil servants without
    debate by members of parliament, as a trivial technicality). Therefore,
    there might be some wisdom in the restriction of valid characters to
    CLDR exemplarCharacters without the "auxiliary" collection.

    The next issue I face is the normalization of punctuation marks and digits.

    I didn't check how this relates to IDN issues with the solution design
    perspective.

    You are welcome to provide feedback!

    Regards,

    -- 
    - Thierry Moreau
    CONNOTECH Experts-conseils inc.
    9130 Place de Montgolfier
    Montreal, Qc
    Canada   H2M 2A1
    Tel.: (514)385-5691
    Fax:  (514)385-5900
    web site: http://www.connotech.com
    e-mail: thierry.moreau@connotech.com
    


    This archive was generated by hypermail 2.1.5 : Wed May 13 2009 - 22:09:47 CDT