From: Asmus Freytag (firstname.lastname@example.org)
Date: Mon Jul 12 2004 - 13:11:35 CDT
At 01:02 AM 7/10/2004, Marcin 'Qrczak' Kowalczyk wrote:
>But there are cases when I would prefer to fold Polish diacritics in
>It's basically every case when you are not sure that all stored data is
Or when you are unsure how it is spelled, for example, looking up a
personal or geographic name you are not familiar with.
The discussion started around the case where searching is not localized
(tailored) to the language, which, by definition means that users will not
be familiar with the spelling of the items they are trying to retrieve.
>If one wants to find data containing a word, rather than collect
>statistics about usage of a word with and without diacritics, it's very
>rare than folding does some harm.
>Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
>I will be happy to find occurrences of JEZYK too (non-existing word,
>must have had diacritics stripped), but it makes no sense to return
>JEŻYK (another existing word). It's not just making the letters
There are other types of searches than 'google'. One example is searches
for for station names on services such as http://www.bahn.de. Unlike
air-travel sites, the number of destinations (all across Europe, by the
way), is huge, as the site also includes commuter train services.
They've changed their search algorithm a number of times over the years,
but at one time, you could enter a destination without diacritics and it
would attempt to match that to the list of known station names. In case of
multiple hits it would give you a list to pick from. They also supported
alternative non-native names (such as Cologne). I haven't used it in a
while, so I don't know what they support today, but when I did, I found it
very useful in looking up destinations.
I have a certain sympathy for the idea of designing UCA so that the
untailored *default* works for such kind of multilingual usage. However,
the other use of the DUCET is to be the most convenient base for applying
all tailorings. I have a certain sympathy for the position that claims that
there are important, but perhaps specialized or not economically powerful
classes of users that will not likely have access to a tailored UCA for
their language or writing system.
If that is really the case, i.e. appreciable numbers of smaller languages
would be able to survive without tailoring, then the alternative to fixing
the DUCET could be a separate publication of a common base tailoring for
multilingual data access. (A base tailoring would be applied before further
tailoring for a specific language).
This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 13:12:14 CDT