From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 07 2007 - 09:56:25 CDT
The present standard for International Domain Name Processing (nameprep -
RFC 3491 and stringprep - RFC 3454) currently operates with four steps:
mapping, normalisation (NFKC), prohibition and bidi checking. Mapping
replaces single characters by sequences, which may be empty. It is composed
of two elements - deletion of default ignorables, and full case-folding,
complicated because it is done before compatibility decomposition. (I may
have missed some minor wrinkles in mapping.)
The purpose of normalisation here is to remove homographs. In general this
only works within a script - confusion caused by mixing scripts has to be
handled by other means. However, there appear to be gaps in Unicode
normalisation which cannot now be corrected in the standard normalisations.
Some of these may be genuine omissions - in other cases there may be valid
disputes as to whether some sequences should be equivalent. There is also a
normalisation problem with combining characters of class 0, partly dealt
with by the Unicode standard defining the 'proper' sequencing in common
cases.
Who is keeping track of these omissions for the purposes of IDN? Known
examples include decompositions of Devanagari independent vowels (Unicode
does not define any such decompositions) and unligated Latin digraphs.
Conjuncts in Indic scripts (both Indian and non-Indian) are another
potential problem area. Solutions may range from banning combinations (not
currently a stringprep option) to customising Unicode normalisation (also
not currently a stringprep option) - formally there is little difference,
for the step after normalisation in the processing is prohibition.
Richard.
P.S. I understand that a 'customised Unicode normalisation' is just another
string folding.
This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 09:59:09 CDT