RE: Unicode Search Engines

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Feb 20 2002 - 13:29:09 EST


John Cowan wrote:
> Marco Cimarosti scripsit:
> > Why? Isn't that what W3C asked?
>
> No. The W3C CharMod wants receivers to check normalization and
> reject unnormalized documents, *not* to normalize input. Silently
> normalizing input can conceal the existence of a security-related
> spoof that is NFC-equivalent to a genuine document.
> It is essentially the same reason that broken HTML or broken UTF-8
> should not be silently repaired.

How does this happen, in practice? Is it the author(ing tools) of HTML
documents or web servers which do normalization?

And what about documents not in Unicode? Should they be converted into
Unicode, normalized, and then converted back into the original encoding?

> > BTW, are you sure that it is NFKC? My understanding is that
> it was NFC + some extra passages.
>
> It is NFC, with the additional proviso that n11n must be done even
> if characters appear as character references (&#xnnnn;) rather than
> actual characters.

I should not have said "extra passages" but rather "extra rules". I think
that there is a list of Unicode characters which are not allowed (forbidden?
deprecated?) in HTML specs. E.g., I think that bidi controls are not allowed
because they duplicate tagging.

_ Marco



This archive was generated by hypermail 2.1.2 : Wed Feb 20 2002 - 14:07:40 EST