IDN and Missed Normalisations

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 07 2007 - 09:56:25 CDT

  • Next message: Marnen Laibow-Koser: "Plum, Plumb, and Plume -- was: Re: Uppercase=?iso-8859-1?q?=DFiscoming?(U+1E9E)"

    The present standard for International Domain Name Processing (nameprep -
    RFC 3491 and stringprep - RFC 3454) currently operates with four steps:
    mapping, normalisation (NFKC), prohibition and bidi checking. Mapping
    replaces single characters by sequences, which may be empty. It is composed
    of two elements - deletion of default ignorables, and full case-folding,
    complicated because it is done before compatibility decomposition. (I may
    have missed some minor wrinkles in mapping.)

    The purpose of normalisation here is to remove homographs. In general this
    only works within a script - confusion caused by mixing scripts has to be
    handled by other means. However, there appear to be gaps in Unicode
    normalisation which cannot now be corrected in the standard normalisations.
    Some of these may be genuine omissions - in other cases there may be valid
    disputes as to whether some sequences should be equivalent. There is also a
    normalisation problem with combining characters of class 0, partly dealt
    with by the Unicode standard defining the 'proper' sequencing in common
    cases.

    Who is keeping track of these omissions for the purposes of IDN? Known
    examples include decompositions of Devanagari independent vowels (Unicode
    does not define any such decompositions) and unligated Latin digraphs.
    Conjuncts in Indic scripts (both Indian and non-Indian) are another
    potential problem area. Solutions may range from banning combinations (not
    currently a stringprep option) to customising Unicode normalisation (also
    not currently a stringprep option) - formally there is little difference,
    for the step after normalisation in the processing is prohibition.

    Richard.

    P.S. I understand that a 'customised Unicode normalisation' is just another
    string folding.



    This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 09:59:09 CDT