Re: Unicode abuse

From: Erik van der Poel (
Date: Sun Mar 06 2005 - 23:29:44 CST

    I'm sorry. Maybe I just confused people by bringing HTML into the
    discussion. So let me talk about Nameprep itself. In the name of
    typeability in Nameprep, compatibility characters are normalized via
    Unicode's Normalization Form KC (NFKC). The example that everyone always
    seems to use is the set of "wide" characters used in Japan, etc. The
    claim is that those wide characters are very easy to type in Japanese
    input methods, and that it would be nice if Nameprep automatically
    mapped those characters to the "real" characters, i.e. the normal width
    versions. E.g. wide 'a' becomes regular 'a'.

    Now, instead of adopting bits and pieces of Unicode and NFKC, Nameprep
    decided to keep things simple, and adopted the whole process. The result
    was that a number of not-so-easily typed characters, such as
    double-struck C, also got included. There is no good reason to map
    double-struck C to regular 'c' because the regular 'c' is far easier to
    type at the keyboard, unlike the Japanese wide character case.

    So, all I'm saying is that by adopting basically all of Unicode 3.2 and
    the whole NFKC process for those characters (followed by some
    prohibitions after those steps), Nameprep ended up allowing such
    inappropriate characters as double-struck C to be fed into the mapping
    process. I believe this was unnecessary. Nameprep could instead have
    chosen to return an error upon encountering double-struck C before


