From: Asmus Freytag (
Date: Mon Feb 23 2009 - 01:55:11 CST

  • Next message: David Starner: "Re: NFC FAQ"


    I think Michael D. Adams has a point, which is worth considering.

    NFC is clearly an important format and part of the reason to have
    this FAQ is to convince people that normalization to NFC is not

    The full argument has two prongs. You've delivered the first one,
    that is, "If I had to normalize the web, what's the complexity of the
    task?". Your answer, of course, is that most of the web is close to
    NFC, so on average, little work remains to be done beyond
    verification (quick check).

    The second question people want to be reassured is as follows:
    "Is there some known data format that, for data in some language,
    forces NFC to be unacceptably slow, if I have to predominantly
    process data from that language?"

    I believe that the answer to that question is also largely positive,
    because most languages (or data formats) don't produce infinitely
    long runs of combining characters that need composition or

    European data in NFD, which I suspect is not an actual worst
    case in that light, produces about 10% characters that need
    combination, but, as doubly accented characters are not
    part of the usual alphabets, there's little scope for reordering.

    Any implementation that fast-tracks the remaining 90% of
    characters in such data is still going to be fast. And any
    dilution of such data with HTML/XML keys is going to
    improve matters.

    However, in order to win over people who harbor doubts,
    it would be useful, if you (or people with experience of
    challenging combinations of language and data format)
    could discuss what "realistic" worst cases might look like
    and discuss how that would affect the performance in
    situations where such data were to dominate.

    I suspect that the answer is that the answers would still
    point to encouragingly low upper bounds, but at the moment,
    the argument's second prong is not finished.


    This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 01:56:58 CST