Re: Interleaved collation of related scripts

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 13 2004 - 16:33:57 CDT

  • Next message: Kenneth Whistler: "RE: interleaved ordering (was RE: Phoenician)"

    Peter Kirk noted:

    > > PS Multi-language bibliographies are common in Russian books. They are
    > > usually printed with the Latin script entries following the Cyrillic
    > > script ones, but I have seen interleaved ones.

    Chris Jacobs noted:

    > has an index in which greek and latin script are interleaved.
    >
    > The greek words are sorted according to their transliteration:
    >
    > ̔ sorts as h
    > φ sorts as ph

    These illustrate the typical situation with cross-script,
    cross-language interfiling: They are *custom* solutions for
    particular indexing problems. And they may involve issues of
    transliteration or other adaptation to make like match with
    like for the purposes of the people using the interfiled list.

    Such tasks should *not* be attributed to the default collation
    element table for the Unicode Collation Algorithm. It is
    just inappropriate design, failing to separate functions
    into appropriate layers. Throwing too many requirements
    at the default table has at least two bad results:
    A. It makes the table itself more complex, which means that
    *all* implementations that deal with it have to deal with
    additional complexity -- complexity that is often irrelevant
    except to the barest minority of specialized users of sorting.
    B. It makes it more difficult to figure out how to tailor
    and customize the base tables and their behavior for those
    instances where something really specialized actually *is*
    needed (such as the Greek and Latin index cited above).

    It is the same kind of error, in my opinion, as designing
    a language parser, for example, and then requiring that it
    handle character input in any encoding. If that task is
    attributed to the *lexer* itself, you end up with an
    unholy mess. The correct design is to use a correctly
    architected character set conversion module, convert all the
    input into Unicode, and design the lexer to handle Unicode
    character input.

    Mike Ayers is on the right track here, I believe. The scenarios
    which people are adducing in arguing for interfiling should
    be addressed instead by appropriately designed normalizations --
    which can be implemented using fairly easy-to-program,
    reusable scripts. Then sort on the *normalized* data using
    a much, much simpler collation table to accomplish what you
    need.

    People expecting to import their particular normalization
    needs *into* the default collation element table, expecting
    thereby to get "for free" the behavior they want right off
    the shelf in Windows sorting API's, are, in effect, doing harm
    to all users of the UCA, without actually buying themselves
    the flexibility that they need to accomplish what they need
    to do in the end, anyway.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu May 13 2004 - 16:35:08 CDT