Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 29 2010 - 17:52:59 CDT

  • Next message: Martin J. Dürst: "Re: Digit/letter variants in the "same" unified script"

    A couple of weeks ago, in this thread Philippe Verdy said:

    > Breaking on words, even if it requirs a very modest buffering,
    > will significantly improve the processing time,
    > because each word in the long texts will be scanned only
    > once, and all the rest will occur within the small and
    > constantly reused buffer.
    ...
    > I don't forget that in most practical cases, sorts will operate
    > on texts whose collation keys have been only partly
    > generated and truncated, because they really speed up and
    > reduce the number of compares to perform ...

    and so on.

    Instead of continuing the discussion with a back and forth in
    email, I decided instead to write a Unicode Technical Note
    on the general topic, including a case study of alternative
    orderings for a French topic list.

    Those who are interested in collation and in the particular issues
    that were discussed in this thread may wish to take a look:

    http://www.unicode.org/notes/tn34/

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 17:54:51 CDT