From: Philippe Verdy (email@example.com)
Date: Fri May 07 2004 - 18:49:38 CDT
From: "E. Keown" <firstname.lastname@example.org>
To: <email@example.com>; "Jony Rosenne" <firstname.lastname@example.org>
> > This could be solved by making Phoenician and Hebrew
> > base characters equivalent
> > at the first level of collation.
> Could this be translated and expanded into Basic
> Not-so-Geeky English???---Elaine
Collation is the process of converting strings into binary comparable "collation
keys" (also known as "sort elements"). This is used to match words or sort them
according to a linguistic rule.
Unicode defines such a rule in a table of default collation key (known as DUCET,
or "Default Unicode Collation Elements Table"), that can be used to sort ALL
Unicode characters in a consistent way, but also as a base for tailoring the
collation order to spezcific languages, without needing to recreate the whole
collation table for all defined characters.
A collation key can be thought, in a first approach, as another code substituted
for each character. This works for some languages, but in fact many languages
need further refinements to control how elements collate each other. This first
level allows sorting: A < B < C, or a < b < c, while also grouping together
related characters: a ~ A, b ~ B, and c ~ C.
This means that "AB" will sort between "aa" and "Ab", by ignoring ALL case
differences in ALL character.
However, for strings that sort in the same group, case distinction comes into
effect into a second level, after comparing all characters, instead of just
comparing characters individually.
To make this possible, characters are given collation keys whose first item is
the relative (numeric) order of groups at the first level, and next item is the
relative order of characters in that group.
So for example:
'a' => [1; 10], 'A' => [1; 11],
'b' => [2; 10], 'B' => [2; 11],
'c' => [3; 10], 'C' => [3; 11].
Sorting "aa", "AB", "Ab" means sorting strings of collation keys, considering
each dimension separately in successive passes :
"aa" => [1; 10],[1; 10] => (1, 1); (10, 10)
"AB" => [1; 11],[2; 11] => (1, 2); (11, 11)
"Ab" => [1; 11],[2; 10] => (1, 2); (11, 10)
"Aba" => [1; 11],[2; 10],[2; 10] => (1, 2, 2); (11, 10, 10)
Above the second and third string collate equally at first level, with equal
keys (1,2), but distinct at second level with keys (11,11), (11,10).
To make things simpler, introduce a special collation key value which is lower
than all others, (0 in the example above), and you get a simpler view of
collation elements as a single vector of numeric value, if you use it as a
terminator between each level for the resulting collation string:
"aa" => (1, 1, 0, 10, 10, 0)
"AB" => (1, 2, 0, 11, 11, 0)
"Ab" => (1, 2, 0, 11, 10, 0)
"Aba" => (1, 2, 2, 0, 11, 10, 10, 0)
This simplifies things to get binary comparable vectors of numeric values. The
length of the vector depends on the length (in characters or collation elements)
of input strings, and on the number of levels considered.
Understand here that these collation keys are coordinates in a 2-dimensional
space, instead of just one unique code like code points. Some items may still
have to the same coordinates (if considering only these two dimensions), for
'à' => [1; 10], 'À' => [1; 11]
If you limit the collation level at 2, then there is no way to make distinctions
between 'a' and 'à', so it may be a problem if you want to get a stable sort,
because with only these keys they would be considered as fully equal. So a
Unicode collation will append a final key element that just consists in the code
point value of each character in the source string (independantly of collation
elements considered). This is arbitrary (at a linguistic point of view), but
still repects the 2-level collation order by adding a pseudo third level, so
that sort order of strings in random initial order becomes stable whatever the
order in which they are presented to the sort algorithm.
These collation rules can be given with some basic syntax, without specifying
the exact collation key values (count the number of "<" symbols to determine the
a < b < c;
a << A;
b << B;
c << C;
a = à
which are easily combined into a single rule:
a = à << A < b << B < c << C
Read it arithmetically, with implied grouping as if these were operators with
priorities, where the lowest priority is for the primary collation level
indicated by "<" and the highest priority is for the last collation level set by
((a = à) << A) < (b << B) < (c << C)
-- Now your initial question commenting the Geeky terms.
What was said above is that the 22 letters of Phoenician would compare equally
at first collation level with the corresponding 22 base letters of Hebrew,
because these 22 letters in Hebrew are comparable at this level (the 5 final
letter forms could be compared at this level too or at a secondary level,
depending on tailored linguistic rules).
So at first level, 'HEBREW ALEF' = 'PHOENICIAN ALEF' < 'HEBREW BET' =
This could be defined in the DUCET as the default collation order (and this
would be enough to make Hebrew readers of Phoenician happy.) Greek readers of
Phoenician could as well tailor their collation to match ALEF with ALPHA...
It is possible to do that without affecting the relative collation order of ANY
Hebrew-only string, by assigning them a secondary or tertiary difference rather
than a primary difference, so that a collation performed only at first level
would group together the same Phoenician words written either with the
Phoenician script or with the Hebrew script (provided that no additional Hebrew
combining points or final forms are used into the Hebrew transliteration of
Hope this helps.
This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 19:10:48 CDT