Re: data for collation tests

From: Alain LaBont/e'/ (alb@sct.gouv.qc.ca)
Date: Fri Feb 07 1997 - 16:17:24 EST


At 12:51 97-02-07 -0800, unicode@Unicode.ORG wrote:
>A word containing spaces I would personally call a 'set phrase' in
>English. Is this term acceptable for use in iso10646/unicode
>discussions?
>mg
>
>Marion Gunn
>mgunn@egt.ie
>
>
>unicode@Unicode.ORG wrote:
>>...
>> examples... This notion of word that you use is Germanic. Good in this
>> context where spaces would be removed to form a new "word". OED has tens of
>> pages of definitions for "word", including one for which it is precisely
>> stated that a word may contain spaces.
>>
>> Alain LaBonté
>> Québec

It would be preferable in information technology, as far as sorts are
concerned, to use the term "field", which is traditional in computer
programming. "Cultural sort" of a field should not care about spaces (or it
should, if one really means it, a special space should then be used for this
as in Canadians standard CAN/CSA Z243.4.1). If a space is a delimiter of two
entities that you call a "set phrase", then another field should be used
instead of using an artificial delimiter.

Typically we compare similar fields together. Only when there is equality do
we go to compare the next fields.

This is the traditional way to sort in computer technology. That allows to
do eveything.

But in TC37 terminology, the expressions "word by word" or "character by
character" terminology is wrong as far as actual understanding of what is
going on is taken care of. I can affirm you that what they call "word by
word" is more "character by character" than the other method. It is a pure
positional, character by character sort, where even spaces are counted as
characters, while the other method ignores these characters at the first
level of comparison, as do human beings when they search in a dictionary
(telephone book directories sort by fields, firt names first, second names
after, and within a name, sort is done as it would be done in a dictionary).
So their (TC37's) terminology has to be changed. I've been saying this for
years, but it does not seem to be agreed upon by ISO/TC37ers. With what
they're doing, nobody is going to retrieve me in a telephone book! They
should revisit their method and they will see that not all spaces were
created equal.

Alain La Bonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT