Re: Aramaic unification and information retrieval

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 22 2003 - 20:41:36 EST

  • Next message: jameskass@att.net: "Re: Aramaic unification and information retrieval"

    Peter Kirk said:

    > Anyway, I don't see the main purpose of
    > collation as producing lists of legible words, but rather as matching in
    > text and database searches.

    Collation is used for both purposes, of course. And there is nothing
    which requires you to use the same rules for sorting lists as for
    matching for searches.

    Just as a search might choose to ignore case, a search can be defined
    which would ignore specific script differences via a tailored
    weighting. Thus for instance you could, right now, choose to
    implement a tailoring of the UCA default tables which would
    give Syriac letters identical weights as [square] Hebrew letters.
    You could then turn a search using that collation weighting loose
    on a corpus of Aramaic data in both Hebrew and Syriac script and
    get the kind of cross-script matching for identical Aramaic
    "underlying forms" that you are looking for, I presume.

    Of course, none of that would be free out of the box from any OS,
    but with advanced tools like ICU it is not that difficult to
    create specialized collations along these lines and then use
    them to implement custom searches. It is a little more
    difficult to integrate them into off-the-shelf databases, but
    most databases implement some kind of capability for stored
    procedures, and you can create indexes off stored key fields
    that are built using such stored procedures. That should enable
    arbitrarily defined searching into data stores.

    > I think that it just might be acceptable to encode
    > the various ancient Semitic scripts separately if they are unified for
    > collation.

    As Michael indicated, separate scripts defined and encoded in
    the Unicode Standard will, in the default collation table, get
    separate primary weighting. That is the basic pattern followed
    in the table, and is the most conservative approach, since it
    does not presume removal of distinctions for the default.
    In my opinion, the structure of the collation table should not,
    however, be the main consideration which goes into determining:

    A. Whether a particular historic variant of some writing system
       should be separately encoded. (Meaning does the graphological
       analysis in the context of character encoding suggest that
       separate encoding makes more sense than unification with
       something else already encoded?)
       
    B. Whether, given a technical determination in (A) that a
       separate script encoding is warranted, whether it should be
       encoded at all. (Meaning is there any actual scholarly need
       for an encoding of that particular form, or would encoding
       simply be an exercise in script coverage completeness,
       without any actual application?)
       
    For "Aramaic", it isn't clear to me that we have consensus
    yet about either of these "shoulds".

    > But if you are saying that it must be all or nothing, I will
    > continue to fight on behalf of the users of these scripts for all of
    > what they want, rather than what you have apparently unilaterally (on
    > the basis of a book which describes glyph shape differences rather than
    > the systematic differences which really distinguish scripts) decided
    > that they ought to want and have written into your Roadmap.

    Them's fightin' words. Howzabout, as Michael suggested, we
    simply cool it a little about Aramaic? Ancient forms of Aramaic
    aren't going to be taken up anytime soon for any consideration
    for encoding. And the Roadmap cannot be taken as a predetermination
    of the eventual decisions in this regard, in my opinion.

    If there is, however, some consensus that Samaritan and
    Manichaen *do* deserve separate encoding consideration, how
    about pursuing the furthering of encoding proposals for those
    as distinct scripts and then come back around later to review
    the ancient forms once again after some more of the
    pieces have fallen into place?

    In the meantime, rather than harumphalating that Aramaic
    scholars are being confused by the Unicode Roadmap, I think
    it would serve everyone much better if someone knowledgable
    about Aramaic scholars' text encoding needs and practices
    (you and others contributing to this discussion on the Hebrew
    list in particular?) would write up a "Guide to Best Practices for
    Aramaic Text Representation Using Unicode" and publish
    it as a Unicode Technical Note. Then people could refer to and
    be referred to *that*, instead of puzzling over a bunch of
    sketchy, possible script encoding assignments on the Roadmap
    which may or may not represent anything that will ever actually be
    encoded in this area.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Dec 22 2003 - 21:27:21 EST