Re: Diacritic and similar foldings and spam filtering

From: Peter Kirk (
Date: Thu Jul 08 2004 - 17:46:31 CDT

  • Next message: Mike Ayers: "RE: Diacritic and similar foldings and spam filtering"

    On 08/07/2004 23:22, Doug Ewell wrote:

    >Peter Kirk <peterkirk at qaya dot org> wrote:
    >>António suggested a serious point that for more comprehensive spam
    >>filtering an enhanced folding might be useful, including such foldings
    >>as | > I (capital i) and l (small L), 0 (zero) > O, |\/| > M. Would
    >>such foldings in fact be feasible and useful? They would have to be
    >>part of a general similar shapes folding.
    >They might be useful for certain applications, in specific situations,
    >but Unicode should not ever try to get entangled in this business of
    >mapping unrelated characters on the basis of glyph similarity alone.
    >It's just too font-dependent and subjective.
    >See the sub-heading "Spoofing" in TUS 4.0, Section 5.19 "Unicode
    >Security," pp. 141-142 for more information.
    Thank you for pointing me to this section. This is a useful discussion
    which shows clearly why spoofing cannot be avoided by identical encoding
    of confusables. (And I am glad to see some clearer terminology than I
    had been using.) But it doesn't address my point that UTR #30 folding
    can be useful in this area, in providing a framework for what might be
    called "confusable folding".

    But I think I agree with you that Unicode should not get into detailed
    listing of confusables, because it is too font-dependent and subjective.
    This kind of thing is best left as a user definable folding.

    Actually I am unclear from UTR #30 whether this is supposed to be a
    framework for user definable foldings or should be restricted to the
    defined list of foldings; the existence of "Foldings based on tailored
    collation data" suggest that foldings can at least be tailored, but
    there are no further details of how such foldings are covered by the UTR.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Thu Jul 08 2004 - 18:19:34 CDT