Re: Looking for transcription or transliteration standards latin- >arabic

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Jul 10 2004 - 03:02:22 CDT

  • Next message: Peter Kirk: "Re: Changing UCA primarly weights (bad idea)"

    W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał:

    > o-slash, can be analyzed as o and slash, even though that's not done
    > canonically in Unicode. Allowing users outside Scandinavia to perform
    > fuzzy searches for words with this character is useful.
    >
    > In this view of folding, Language-specific fuzzy searches would be tailored
    > (usually by being based on collation information, rather than on generic
    > diacritic folding).

    In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the
    corresponding letters without. Omitting diacritics is an error, even
    though text without them is generally readable. They are removed when
    the given protocol requires or encourages ASCII (e.g. filenames to be
    used in URLs, login names, variable names in programming languages,
    ancient computer systems). There is no alternate spelling scheme like
    German AE/OE/UE/SS.

    Polish leters are never folded when sorting lexicographically. This
    applies to Ł in the same way as to other eight letters. Foreign
    diacritics are always folded though, at least I don't remember seeing
    any other case. I think Ó would be folded together with O in an
    encyclopaedia if this is a foreign O with some accent, unrelated to
    Polish Ó which is a separate letter (can you suggest some non-Polish
    word starting with Ó which could be found in an encyclopaedia?).

    But there are cases when I would prefer to fold Polish diacritics in
    searches.

    It's basically every case when you are not sure that all stored data is
    using diacritics, for example in generic WWW searching. There are still
    people who don't use diacritics in usenet and email, or in entries in
    guest books and other "unprofessional" web content. There are even
    sometimes people who insist that Polish letters *should not* be used in
    usenet and email because some computer systems can't handle them.
    Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
    between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
    (because of laziness). This is why for searching archives of unknown
    data it's generally better to fold them.

    As far as I know, the default UCA folds these letters except Ł, and
    standard Polish tailoring doesn't fold any Polish letter. While not
    folding them in searching is technically correct and nobody would be
    surprised that they are not folded, it's often more useful to fold them
    and people would be pleasantly surprised if they don't have to repeat
    the search with omitted diacritics.

    If one wants to find data containing a word, rather than collect
    statistics about usage of a word with and without diacritics, it's very
    rare than folding does some harm.

    Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
    I will be happy to find occurrences of JEZYK too (non-existing word,
    must have had diacritics stripped), but it makes no sense to return
    JEŻYK (another existing word). It's not just making the letters
    equivalent.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sat Jul 10 2004 - 03:04:29 CDT