Re: Canonical equivalence in rendering: mandatory or recommended?

From: John Cowan (
Date: Wed Oct 15 2003 - 11:00:17 CST

Jill Ramonsky scripsit:

> I had to write an API for my employer last year to handle some aspects
> of Unicode. We normalised everything to NFD, not NFC (but that's easier,
> not harder). Nonetheless, all the string handling routines were not
> allowed to /assume/ that the input was in NFD, but they had to guarantee
> that the output was. These routines, therefore, had to do a "convert to
> NFD" on every input, even if the input were already in NFD. This did
> have a significant performance hit, since we were handling (Unicode)
> strings throughout the app.

Indeed it would. However, checking for normalization is cheaper than
normalizing, and Unicode makes properties available that allow a streamlined
but incomplete check that returns "not normalized" or "maybe normalized".
So input can be handled as follows:

        if maybeNormalized(input)
        then if normalized(input)
                then doTheWork(input)
                else doTheWork(normalize(input))
        else doTheWork(normalize(input))

The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.

> I think that next time I write a similar API, I wll deal with
> (string+bool) pairs, instead of plain strings, with the bool meaning
> "already normalised". This would definitely speed things up. Of course,
> for any strings coming in from "outside", I'd still have to assume they
> were not normalised, just in case.

W3C refers to this concept as "certified text". It's a good idea.

> Jill

Verbogeny is one of the pleasurettes    John Cowan <>
of a creatific thinkerizer.   
   -- Peter da Silva          

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST