From: Asmus Freytag (email@example.com)
Date: Mon Jul 12 2004 - 13:21:34 CDT
I missed Mark's change in subject - so I replied to Marcin's message right
now under the old subject line:
>----- Original Message -----
>From: "Marcin 'Qrczak' Kowalczyk" <firstname.lastname@example.org>
>Sent: Saturday, July 10, 2004 01:02
>Subject: Re: Looking for transcription or transliteration standards
> > W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał:
> > > o-slash, can be analyzed as o and slash, even though that's not done
> > > canonically in Unicode. Allowing users outside Scandinavia to perform
> > > fuzzy searches for words with this character is useful.
> > >
> > > In this view of folding, Language-specific fuzzy searches would be
> > > (usually by being based on collation information, rather than on generic
> > > diacritic folding).
> > In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the
> > corresponding letters without. Omitting diacritics is an error, even
> > though text without them is generally readable. They are removed when
> > the given protocol requires or encourages ASCII (e.g. filenames to be
> > used in URLs, login names, variable names in programming languages,
> > ancient computer systems). There is no alternate spelling scheme like
> > German AE/OE/UE/SS.
> > Polish leters are never folded when sorting lexicographically. This
> > applies to Ł in the same way as to other eight letters. Foreign
> > diacritics are always folded though, at least I don't remember seeing
> > any other case. I think Ó would be folded together with O in an
> > encyclopaedia if this is a foreign O with some accent, unrelated to
> > Polish Ó which is a separate letter (can you suggest some non-Polish
> > word starting with Ó which could be found in an encyclopaedia?).
> > But there are cases when I would prefer to fold Polish diacritics in
> > searches.
> > It's basically every case when you are not sure that all stored data is
> > using diacritics, for example in generic WWW searching. There are still
> > people who don't use diacritics in usenet and email, or in entries in
> > guest books and other "unprofessional" web content. There are even
> > sometimes people who insist that Polish letters *should not* be used in
> > usenet and email because some computer systems can't handle them.
> > Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
> > between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
> > (because of laziness). This is why for searching archives of unknown
> > data it's generally better to fold them.
> > As far as I know, the default UCA folds these letters except Ł, and
> > standard Polish tailoring doesn't fold any Polish letter. While not
> > folding them in searching is technically correct and nobody would be
> > surprised that they are not folded, it's often more useful to fold them
> > and people would be pleasantly surprised if they don't have to repeat
> > the search with omitted diacritics.
> > If one wants to find data containing a word, rather than collect
> > statistics about usage of a word with and without diacritics, it's very
> > rare than folding does some harm.
> > Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
> > I will be happy to find occurrences of JEZYK too (non-existing word,
> > must have had diacritics stripped), but it makes no sense to return
> > JEŻYK (another existing word). It's not just making the letters
> > equivalent.
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 13:22:09 CDT