User Expectations for collation (was Re: Looking for transcription or transliteration standards latin->arabic)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jul 12 2004 - 11:21:27 CDT

  • Next message: Dominikus Scherkl \(MGW\): "Re: Problems Reading Saved Files With Unicode Names"

    These provide good examples. It would be interesting to see, of the people
    on the unicode@unicode.org list, how many non-Poles would expect to find the
    following orders:

    Ab < Ąb < Ac
    Eb < Ęb < Ec
    Ob < Ób < Oc

    Ce < Će < Cy
    Ne < Ńe < Ny
    Sa < Śa < Sy
    Za < Źa < Zy
    Za < Ża < Zy

    and either (a) or (b):

    a) La < Ła < Ly // interleaved
    b) La < Ly < Ła // non-interleaved

    ‎Mark

    ----- Original Message -----
    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    To: <unicode@unicode.org>
    Sent: Saturday, July 10, 2004 01:02
    Subject: Re: Looking for transcription or transliteration standards
    latin->arabic

    > W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał:
    >
    > > o-slash, can be analyzed as o and slash, even though that's not done
    > > canonically in Unicode. Allowing users outside Scandinavia to perform
    > > fuzzy searches for words with this character is useful.
    > >
    > > In this view of folding, Language-specific fuzzy searches would be
    tailored
    > > (usually by being based on collation information, rather than on generic
    > > diacritic folding).
    >
    > In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the
    > corresponding letters without. Omitting diacritics is an error, even
    > though text without them is generally readable. They are removed when
    > the given protocol requires or encourages ASCII (e.g. filenames to be
    > used in URLs, login names, variable names in programming languages,
    > ancient computer systems). There is no alternate spelling scheme like
    > German AE/OE/UE/SS.
    >
    > Polish leters are never folded when sorting lexicographically. This
    > applies to Ł in the same way as to other eight letters. Foreign
    > diacritics are always folded though, at least I don't remember seeing
    > any other case. I think Ó would be folded together with O in an
    > encyclopaedia if this is a foreign O with some accent, unrelated to
    > Polish Ó which is a separate letter (can you suggest some non-Polish
    > word starting with Ó which could be found in an encyclopaedia?).
    >
    > But there are cases when I would prefer to fold Polish diacritics in
    > searches.
    >
    > It's basically every case when you are not sure that all stored data is
    > using diacritics, for example in generic WWW searching. There are still
    > people who don't use diacritics in usenet and email, or in entries in
    > guest books and other "unprofessional" web content. There are even
    > sometimes people who insist that Polish letters *should not* be used in
    > usenet and email because some computer systems can't handle them.
    > Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
    > between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
    > (because of laziness). This is why for searching archives of unknown
    > data it's generally better to fold them.
    >
    > As far as I know, the default UCA folds these letters except Ł, and
    > standard Polish tailoring doesn't fold any Polish letter. While not
    > folding them in searching is technically correct and nobody would be
    > surprised that they are not folded, it's often more useful to fold them
    > and people would be pleasantly surprised if they don't have to repeat
    > the search with omitted diacritics.
    >
    > If one wants to find data containing a word, rather than collect
    > statistics about usage of a word with and without diacritics, it's very
    > rare than folding does some harm.
    >
    > Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
    > I will be happy to find occurrences of JEZYK too (non-existing word,
    > must have had diacritics stripped), but it makes no sense to return
    > JEŻYK (another existing word). It's not just making the letters
    > equivalent.
    >
    > --
    > __("< Marcin Kowalczyk
    > \__/ qrczak@knm.org.pl
    > ^^ http://qrnik.knm.org.pl/~qrczak/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 11:22:06 CDT