Re: Looking for transcription or transliteration standards latin- >arabic

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jul 12 2004 - 13:11:35 CDT

  • Next message: Asmus Freytag: "Re: User Expectations for collation (was Re: Looking for transcription or transliteration standards latin->arabic)"

    At 01:02 AM 7/10/2004, Marcin 'Qrczak' Kowalczyk wrote:
    >But there are cases when I would prefer to fold Polish diacritics in
    >searches.
    >
    >It's basically every case when you are not sure that all stored data is
    >using diacritics,

    Or when you are unsure how it is spelled, for example, looking up a
    personal or geographic name you are not familiar with.

    The discussion started around the case where searching is not localized
    (tailored) to the language, which, by definition means that users will not
    be familiar with the spelling of the items they are trying to retrieve.

    >If one wants to find data containing a word, rather than collect
    >statistics about usage of a word with and without diacritics, it's very
    >rare than folding does some harm.
    >
    >Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
    >I will be happy to find occurrences of JEZYK too (non-existing word,
    >must have had diacritics stripped), but it makes no sense to return
    >JEŻYK (another existing word). It's not just making the letters
    >equivalent.

    There are other types of searches than 'google'. One example is searches
    for for station names on services such as http://www.bahn.de. Unlike
    air-travel sites, the number of destinations (all across Europe, by the
    way), is huge, as the site also includes commuter train services.

    They've changed their search algorithm a number of times over the years,
    but at one time, you could enter a destination without diacritics and it
    would attempt to match that to the list of known station names. In case of
    multiple hits it would give you a list to pick from. They also supported
    alternative non-native names (such as Cologne). I haven't used it in a
    while, so I don't know what they support today, but when I did, I found it
    very useful in looking up destinations.

    I have a certain sympathy for the idea of designing UCA so that the
    untailored *default* works for such kind of multilingual usage. However,
    the other use of the DUCET is to be the most convenient base for applying
    all tailorings. I have a certain sympathy for the position that claims that
    there are important, but perhaps specialized or not economically powerful
    classes of users that will not likely have access to a tailored UCA for
    their language or writing system.

    If that is really the case, i.e. appreciable numbers of smaller languages
    would be able to survive without tailoring, then the alternative to fixing
    the DUCET could be a separate publication of a common base tailoring for
    multilingual data access. (A base tailoring would be applied before further
    tailoring for a specific language).

    A./



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 13:12:14 CDT