Re: Phoenician

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 07 2004 - 18:49:38 CDT

  • Next message: Philippe Verdy: "Re: Phoenician"

    From: "E. Keown" <k_isoetc@yahoo.com>
    To: <jcowan@reutershealth.com>; "Jony Rosenne" <rosennej@qsm.co.il>
    > > This could be solved by making Phoenician and Hebrew
    > > base characters equivalent
    > > at the first level of collation.
    >
    > Could this be translated and expanded into Basic
    > Not-so-Geeky English???---Elaine

    Collation is the process of converting strings into binary comparable "collation
    keys" (also known as "sort elements"). This is used to match words or sort them
    according to a linguistic rule.

    Unicode defines such a rule in a table of default collation key (known as DUCET,
    or "Default Unicode Collation Elements Table"), that can be used to sort ALL
    Unicode characters in a consistent way, but also as a base for tailoring the
    collation order to spezcific languages, without needing to recreate the whole
    collation table for all defined characters.

    A collation key can be thought, in a first approach, as another code substituted
    for each character. This works for some languages, but in fact many languages
    need further refinements to control how elements collate each other. This first
    level allows sorting: A < B < C, or a < b < c, while also grouping together
    related characters: a ~ A, b ~ B, and c ~ C.
    This means that "AB" will sort between "aa" and "Ab", by ignoring ALL case
    differences in ALL character.

    However, for strings that sort in the same group, case distinction comes into
    effect into a second level, after comparing all characters, instead of just
    comparing characters individually.
    To make this possible, characters are given collation keys whose first item is
    the relative (numeric) order of groups at the first level, and next item is the
    relative order of characters in that group.

    So for example:

    'a' => [1; 10], 'A' => [1; 11],
    'b' => [2; 10], 'B' => [2; 11],
    'c' => [3; 10], 'C' => [3; 11].

    Sorting "aa", "AB", "Ab" means sorting strings of collation keys, considering
    each dimension separately in successive passes :
    "aa" => [1; 10],[1; 10] => (1, 1); (10, 10)
    "AB" => [1; 11],[2; 11] => (1, 2); (11, 11)
    "Ab" => [1; 11],[2; 10] => (1, 2); (11, 10)
    "Aba" => [1; 11],[2; 10],[2; 10] => (1, 2, 2); (11, 10, 10)
    Above the second and third string collate equally at first level, with equal
    keys (1,2), but distinct at second level with keys (11,11), (11,10).

    To make things simpler, introduce a special collation key value which is lower
    than all others, (0 in the example above), and you get a simpler view of
    collation elements as a single vector of numeric value, if you use it as a
    terminator between each level for the resulting collation string:
    "aa" => (1, 1, 0, 10, 10, 0)
    "AB" => (1, 2, 0, 11, 11, 0)
    "Ab" => (1, 2, 0, 11, 10, 0)
    "Aba" => (1, 2, 2, 0, 11, 10, 10, 0)
    This simplifies things to get binary comparable vectors of numeric values. The
    length of the vector depends on the length (in characters or collation elements)
    of input strings, and on the number of levels considered.

    Understand here that these collation keys are coordinates in a 2-dimensional
    space, instead of just one unique code like code points. Some items may still
    have to the same coordinates (if considering only these two dimensions), for
    example:
    '' => [1; 10], '' => [1; 11]

    If you limit the collation level at 2, then there is no way to make distinctions
    between 'a' and '', so it may be a problem if you want to get a stable sort,
    because with only these keys they would be considered as fully equal. So a
    Unicode collation will append a final key element that just consists in the code
    point value of each character in the source string (independantly of collation
    elements considered). This is arbitrary (at a linguistic point of view), but
    still repects the 2-level collation order by adding a pseudo third level, so
    that sort order of strings in random initial order becomes stable whatever the
    order in which they are presented to the sort algorithm.

    These collation rules can be given with some basic syntax, without specifying
    the exact collation key values (count the number of "<" symbols to determine the
    collation level):
        a < b < c;
        a << A;
        b << B;
        c << C;
        a =
    which are easily combined into a single rule:
        a = << A < b << B < c << C
    Read it arithmetically, with implied grouping as if these were operators with
    priorities, where the lowest priority is for the primary collation level
    indicated by "<" and the highest priority is for the last collation level set by
    "=":
        ((a = ) << A) < (b << B) < (c << C)

    -- Now your initial question commenting the Geeky terms.

    What was said above is that the 22 letters of Phoenician would compare equally
    at first collation level with the corresponding 22 base letters of Hebrew,
    because these 22 letters in Hebrew are comparable at this level (the 5 final
    letter forms could be compared at this level too or at a secondary level,
    depending on tailored linguistic rules).

    So at first level, 'HEBREW ALEF' = 'PHOENICIAN ALEF' < 'HEBREW BET' =
    'PHOENICIAN BET'.
    This could be defined in the DUCET as the default collation order (and this
    would be enough to make Hebrew readers of Phoenician happy.) Greek readers of
    Phoenician could as well tailor their collation to match ALEF with ALPHA...

    It is possible to do that without affecting the relative collation order of ANY
    Hebrew-only string, by assigning them a secondary or tertiary difference rather
    than a primary difference, so that a collation performed only at first level
    would group together the same Phoenician words written either with the
    Phoenician script or with the Hebrew script (provided that no additional Hebrew
    combining points or final forms are used into the Hebrew transliteration of
    Phoenician words).

    Hope this helps.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 19:10:48 CDT