Re: Interleaved collation of related scripts

From: Peter Kirk (
Date: Thu May 13 2004 - 10:06:22 CDT

  • Next message: Peter Kirk: "An apology (was: Everson-bashing)"

    On 13/05/2004 05:01, John Cowan wrote:

    >Peter Kirk scripsit:
    >>>I would have just as many objections to doing that as I would with
    >>>unifying it with Hebrew. Users don't expect this kind of interfiling
    >>>when looking things up in ordered lists. Interfiling of scripts
    >>>impedes legibility.
    >>Well, I see the point. But presumably the only people who would collate
    >>a text containing a mixture of Hebrew and Phoenician, for example, are
    >>those who know and understand both scripts. For anyone else this is a
    >>matter of garbage in, garbage out. So it should be up to these users to
    >>decide whether the legibility concern, which is a real one, is more
    >>important than their otherwise expressed preference for interfiling.
    >In addition, it's important to always remember that "collation" is a
    >cover term for both sorting *and* searching. Collating Hebrew with
    >"Phoenician" at the first level means that a search using Hebrew
    >letters will find "Phoenician" text as well.
    >(I am using horror quotes to remind people that Unicode "Phoenician"
    >includes many non-Punic 22CWSAs, particularly Palaeo-Hebrew.)
    >If indeed Serbs prefer collation equivalence between Cyrillic and
    >Latin (which can only be a tailored preference, of course; in general
    >we don't want to do that), this means not only that they will see
    >the two interfiled in a sorted list, but also that searching for a
    >Serbian word in Cyrillic will find it in Latin and vice versa.
    Thank you, John, for making the point which most others have missed.
    This issue is not primarily one of sorting, because multi-script
    individual texts are rather rare. The far more significant issue is
    searching, of a text corpus or for that matter of the whole Internet.
    Suppose I am looking for a Hebrew or Phoenician, or Serbian or
    Azerbaijani text on the Internet. I don't know, and (if I can read both
    scripts) I don't care which script it is in, I want to match the text
    anyway. For such applications interleaved collation would be very helpful.

    I am not proposing interleaved collation of Latin and Cyrillic as a
    default simply, because each of the several languages which can be
    written in both scripts has a different transliteration scheme. So
    tailoring will be required to do this kind of searching for Serbian or
    Azerbaijani. But we have the chance to start afresh with Phoenician, and
    the correspondence between the Hebrew and Phoenician alphabets is

    Perhaps someone, some day, will produce an Internet search engine which
    accepts Unicode tailored collation. But I won't hold my breath.

    PS Multi-language bibliographies are common in Russian books. They are
    usually printed with the Latin script entries following the Cyrillic
    script ones, but I have seen interleaved ones.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 11:31:56 CDT