Re: Interleaved collation of related scripts

From: Peter Kirk (
Date: Fri May 14 2004 - 08:35:09 CDT

  • Next message: "Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))"

    On 13/05/2004 14:33, Kenneth Whistler wrote:

    >Peter Kirk noted:
    >>>PS Multi-language bibliographies are common in Russian books. They are
    >>>usually printed with the Latin script entries following the Cyrillic
    >>>script ones, but I have seen interleaved ones.
    >Chris Jacobs noted:
    >>has an index in which greek and latin script are interleaved.
    >>The greek words are sorted according to their transliteration:
    >> ̔ sorts as h
    >>φ sorts as ph
    >These illustrate the typical situation with cross-script,
    >cross-language interfiling: They are *custom* solutions for
    >particular indexing problems. And they may involve issues of
    >transliteration or other adaptation to make like match with
    >like for the purposes of the people using the interfiled list.
    >Such tasks should *not* be attributed to the default collation
    >element table for the Unicode Collation Algorithm. ...

    I agree that such situations are typical of cross-script interfiling,
    and so I do not support any suggestion of including a general mechanism
    for this in the default collation table. This table is not the place to
    define general purpose transliteration schemes.

    But there is an exceptional issue within the family of north-west
    Semitic scripts, which may apply also to others e.g. Greek, Coptic and
    archaic Greek - possibly also the Indic scripts. Within these sets of
    scripts there is NO ambiguity about which characters correspond to
    which, as they have identical repertoires, with possibly additional
    letters in some of the scripts for which no equivalent can be defined in
    the other scripts. These are marginal cases where some users prefer
    disunification and others prefer unification. Furthermore, they are
    cases where texts originally in the same language and script are encoded
    in Unicode in a variety of scripts, because of changes in Unicode e.g.
    Coptic disunification and because of different scholarly preferences.

    For such cases, in my opinion, a good case can be made for interfiling
    the scripts in the default algorithm. The major advantage of doing this
    is to allow integrated searching of text corpora in which texts have
    been encoded in more than one script.

    >Mike Ayers is on the right track here, I believe. The scenarios
    >which people are adducing in arguing for interfiling should
    >be addressed instead by appropriately designed normalizations --
    >which can be implemented using fairly easy-to-program,
    >reusable scripts. Then sort on the *normalized* data using
    >a much, much simpler collation table to accomplish what you

    Mike Ayers suggested that users should write Perl scripts. This is
    something which computer geeks may be able to do, but it is simply
    impossible for the rest of humanity including scholars of ancient
    languages. Perl is not "God's gift to academic researchers" in general,
    although it may be God's gift to computer geeks.

    The other problem with this is that the large corpora to be searched are
    not necessarily directly available to the users for normalisation. I
    can't normalise the whole Internet before doing a Google search for a
    Coptic or Phoenician word. What I need is a search engine which can (at
    least as a tailoring) collate together Coptic and Greek, Phoenician and

    Ken wrote separately, to Dean Snyder:

    >Nobody plans to take away your rights and ability to continue
    >doing what you now do, if it works very well for you. Please,
    >sir, continue doing what you are doing with your current data.
    Understood, and I note the smiley. But if some people continue to do
    what they are doing and others follow a new script, that is a recipe for
    confusion. The whole point of Unicode is to bring some consistency into
    the previous mess of different character encodings and masquerades. If
    the Unicode staff are now saying that it is OK to write Phoenician
    either with Hebrew characters masquerading as Phoenician or with the
    proposed Phoenician block, that opens the way to perpetuation of the
    confusion which existed before Unicode. It really would be far better,
    in the long run, if you said openly that anyone who continues to write
    Phoenician with Hebrew characters after the new block is accepted is
    wrong and breaking the standard, and should change their practices

    But then if you said that you would of course add a lot more flame to
    the fire, and you would be forced to consider properly whether such
    proposals as the separate Phoenician script have consensus support from
    the majority of regular professional users of the script.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Fri May 14 2004 - 09:42:44 CDT