Multiscript sorting and language tags

From: Alain LaBont\i - SCT (alb@riq.qc.ca)
Date: Fri Jun 20 1997 - 09:33:07 EDT


A 07:34 97-06-20 -0400, Winkler, Arnold F a écrit :
>Alain,
>
>I am sure you have been following the discusssion on the Unicode list
>about language tags. One of the reasons for tagging languages in plain
>text files is always the need for culturally correct sorting.
>
>Would your method work in a case as follows:
>
>In a 10646 plain text file you have 2 tables of names, one German, the
>other one Swedish. Each table needs to be sorted culturally correct.
>
>Is this possible with 14651, and how would that look like (in human
>terms, not c code).
>
>I guess, that this requirement will be valid for multi-lingual data
>bases etc ...
>
>If you have time, please let me have your ideas.
>
>(Glad you found the book)
>
>Regards
>Arnold F. Winkler - Standards Management
>Tel: 610-993-7305, (Unisys NET-322-7305)
>Fax: 610-695-5473
>mailto:Arnold.Winkler@unisys.com

[Alain] :
The answer is YES, it will work. However I have to clarify one important
point.

Sorting is always on a set of records, according to user expectations.
Therefore *for sorting purposes* the language tag (for one given script)
has to be overriden by the user's language expectations which are not
necessary the same as the language tagged in a record or in a sub-record
(or field), which is of course, imho, necessary for other purposes.

A Swedish user will expect Latin script data to be sorted in the Swedish
way, so will a German user. In general for a given script, differences are
slight though, so only a slight correction might be required, and this
means that the predone sort will not be completely wrong at once. In
choosing the re-sorting algorithm, one then has, in case the file was
previously sorted, to assume that data is almost totally sorted at once.
This has effect on performance as there are algorithms beter suited when
this is known. In fact if we could have a file tag giving a clue that data
is sorted according to a given set of tables, taht would be helpful,
although this is of course never absoluetely necessary, it is juts a matter
of performance gain.

For others scripts than those known by a user, no modification needs to be
done at all. That's the spirit of the ISO/IEC 14651 International String
Ordering and Comparison project.

I post this to the unicode list. It is of general interest.

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT