Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 22:08:08 CDT

Next message: Philippe Verdy: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

Previous message: Mark Crispin: "Re: charset parameter in Google Groups"
In reply to: Kenneth Whistler: "Re: UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: Philippe Verdy: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Kenneth Whistler"
> A : verdy_p@wanadoo.fr
> Copie à : unicode@unicode.org, kenw@sybase.com
> Objet : Re: UTS#10 (collation) : French backwards level 2, and word-breakers.
>
>
> Philippe Verdy said:
>
> > A basic word-breaker using inly the space separator would marvelously
> > improve the speed of French sorting even if backwards ordering occurs,
> > just because it would significantly improve the data locality in the
> > implementation and would considerably reduce the reallocations and use
> > of large buffers.
>
> I disagree. Any implementation of an incremental comparison based on
> UCA which is making significant use of reallocations and large buffers
> isn't optimized in the first place.

You are speaking of comparison of text pairs. That's not the same as computing sort keys.

And you have already used the cas of sort keys in a prior message.

> An optimized incremental comparison walks down both string 1 and
> string 2, pulling up the relevant comparable collation elements
> and tracking state for primary, secondary, and tertiary weights
> as it goes, bailing with an answer at the first available point
> that the comparison becomes definitively non-equal. That can be
> done without *any* buffer allocation or reallocation, using
> relatively tiny buffers allocated on the stack, and definitely
> has better "data locality" than first trying to chunk strings at
> word boundaries.

Regarding data locality you're still wrong, because you've not understood what I said:

The case where two stings are fully determined in their relative order after just scanning the beginning to find ANY
difference that may apply at any level, just in the first word where that differs.

This case is really very common. And for data locality, it means that the end of the strings from BOTH texts won't
ever be scanned to make the decision. And these ends can be thousands of bytes on both strings. There will be a
SINGLE pass on each word of each string, even if, I DO admit it, each word may need to be itself scanned several
times.

But just consider as well the need for breaking collation elements: this is really the complex area of UCA, because
it will have to be performed several times, using quite complex decisions. These descisions have even a higher
complexity than word-breaking.

If you have to scan each word several times, you'll have to repeat this complex logic, which will also have to take
into account the default grapheme custer boundaries for correct determination of normalization. In all cases, this
will require mutliple bufferings and rebufferings. to comulate the default graphemes clusters and normalize them,
then looking for normalization, then trying to lookup for matching 1-to-N and M-to-N collation weights for getting
the collation weights at some level to compare.

And all this work that occur within one word will have to be performed once again : it will be much slower if these
multiple passes on default grapheme boundaries are separated in time by having scanned BOTH texts entirely at one
level just to see that they still compare equal, when you coould have broken the steps early by detecting ANY
different within the same word.

Breaking on words, even if it requirs a very modest buffering, will significantly improve the processing time,
because each word in the long texts will be scanned only once, and all the rest will occur within the small and
constantly reused buffer.

And this also just requires the same implementation to generate the collation string of each word individually
within this small reused buffer (when you are about to generate and store sort keys), and when comparing string
pairs.

I don't forget that in most practical cases, sorts will operate on texts whose collation keys have been only partly
generated and truncated, because they really speed up and reduce the number of compares to perform : when you are
sorting texts, any fast algorithm will still need to compare the same texts with a variable number of multiple other
candidates for swapping (on average this number of compares per string in the list of size N grows in O(log N) with
QuickSort, but may still have a worst case that grows in O(N):

So the cumulative effects of low data locality when scanning any "long" texts made of more than 1 word (e.g. a
bibliographic entry, with Author name, book title, subtitle, chapter title...) will be multipled by O(log N) up to
O(N). And data locality will be worse, because each time you'll have to work by scanning on the source texts (in
very different locations in memory (possibly even on external memory and requiring I/O with its own internal
buffering, or in swapped out memory pages) when performing the many text pairs compares, than when performing those
collation level passes only within the small local buffers for each word.

> > A word breaker anyway has a very tiny impact on performance, compared
> > to the gain you can have by using smaller data buffers: most of the
> > cost of UCA falls within the huge cost of normalizations and complex
> > analysis of M-to-N mappings of collation elements, and in
> > interpreting/locating the DUCET table entries.
>
> That much I *do* agree with, however. If you have to normalize
> entire strings before starting comparison, that is a significant
> processing cost. The answer to that is architectural -- if you
> have to repeatedly normalize strings before comparing them,
> then store the normalized strings and compare *those* instead.
> Also, even in the case where dealing with "wild" data input that
> cannot be assumed to be pre-normalized, a test to check whether
> a string is in NFC is *much* quicker than actually doing the
> normalization, and for most data the answer is yes, anyway.
>
> As for the M-to-N mappings and the collation element lookup,
> that is unavoidable, however else you build the comparison, so
> you just do the best job to make the lookup of collation
> elements fast.

And admit it, the complexity of these mappings, that require determining conditional breaks between one or more
default grapheme clusters (which themselves only require a single state finite automata), require more complexity
than word-breaking which just uses a single state in the scanning automata. It will be best to avoid having to
perform all these scans multiple times, and best if you can perform the DUCET lookups only once:

- you'll need a recyclable input buffer, whose small size will be limited to word sizes, for scanning default
graphemes and perform normalization only once per default grapheme,

- you'll need a few weight buffers (one per collation level), whose size will be also more or less proportional to
word sizes, to perform only once the all M-to-N lookups in the DUCET (or its tailoring) while making the first pass
for level 1

- no extra buffering will occur for level 2 and after: all is already in these small buffers that can be simply
scanned linearily without any breakers or any additonal lookup.

> But your argument that a word breaker "has a very tiny impact
> on performance" compared to normalization or collation element
> lookup is moot, if the addition of the word breaker isn't
> going to improve the incremental comparison, anyway -- which
> is my argument. And unless your word breaker really is
> a trivial one (i.e. break on spaces), it isn't correct to
> maintain that it is a "tiny impact" compared to normalization,
> for example.

Yes, I maintain it : this is a EXTREMELY TINY impact, as the word breaker CANNOT add more complexity than:

- the combining sequence breaker used for normalization subbstitutions and reordering, and needed for Unicode
conformance (this requires an input buffer anyway, that you can keep up to the last collation level for binary
compares if ever the all collation weights compare equal within the same word).

- and the default grapheme cluster breaker needed for UCA conformance (this is also at this level that a word-
breaker will work, adding at most 1 or no additional state for its finite state automata : this adds no extra
buffering, just a single enumerable state for both breakers).

- and the collation element breakers needed to perform substitutions and/or lookups in the table of collation
weights : this is at this level that you need small buffers during the level 1 pass, for accumulating the collation
weights that ''may'' be used later, but if your collation is limited to level 1, you don't even need these buffers,
and will just need the first input buffer already used for normalization, to perform the final binary compares).

For French for example, as all words are 25 characters long at most, and given than only 3 levels are needed for the
most precise case, this just requires:

- a small buffer of 25 code points (or UTF-16 code units) for input and normalization. This buffer may be
resizeable on exceptional cases with "pathological words" (in fact codes or numers) like AAAAAAAAAAAAAAAAAAAAAAA.

- 2 small buffers for 25 secondary weights, and 25 ternary weights. Only the level collation weight buffer for
level-2 will need to be processed in backwards order, and there's actually no need to flip it on pass 2 if you need
it.

- So the total buffering need is 75 integers (at most) for non-pathological cases for each string, and all texts
will be processed only once : data locality will be excellent, even if, for security or edge cases, those buffers
will be resizeable.

- Yes you may also stop buffering if these buffers get their maximum size on pathological cases (but in that case
you'll to reperform the scans (but you may avoid very easily the most common worst case and cost of these additional
passes, that will occur when the normalized inputs have equal binary value).

And with exactly the same shared source code above, you can either return immediately the generated collation weight
for building a function computing a collation string for storage, of for comparing pairs of texts : you'll have
built a collation weight enumerator for level 1 or a separate enumerator for the other levels, and a separate code
for returning the normalized codepoints. All of these 3 types of collation weight enumeratots can be used to compare
string directly withinin those small buffers ; you may then :

- flush out all these 3 buffers entirely between each words when just performing only text compares.

- feed a collation string/sort key output buffer, if you want to store it (which you may also limit immediately to
the maximum sort key length that your storage will accept to store).

Your approach on the opposite will still use the input buffer for normalization (my approach does not change it),
and an output buffer for collation string (so my approach makes no difrerence), but will perform many more lookups,
as these complex collation element breaking and complex lookups in the colllation weights will be performed on each
pass (even if you can avoid also the worst case where two binary identical strings compare equal, to avoid the extra
passes).

Next message: Philippe Verdy: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Previous message: Mark Crispin: "Re: charset parameter in Google Groups"
In reply to: Kenneth Whistler: "Re: UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: Philippe Verdy: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 22:10:23 CDT