Re: character groupings in various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 16 2003 - 14:58:49 EDT

  • Next message: Kenneth Whistler: "Re: Unicode conformant character encodings and us-ascii"

    Ben Dougall asked:

    > anyone? : uca and collation to ascertain various possible character
    > groupings / catagorisations that are specific to various specified
    > languages? to get some other matches, more than just an absolute match
    > or not absolute match?

    Use of the collation algorithm to do this is probably overkill.

    If you are looking to make arbitrary sets of equivalences, such
    as "all the consonants of English", then you should probably just
    write your own ad hoc foldings.

    >
    > am i on the right track there? or is there a better direction maybe?
    > i'm looking for a reasonably even coverage of the main languages.

    Don't expect to find such information out of a character encoding
    standard or out of the Unicode Collation Algorithm.

    The point of the character encoding standard is to encode all
    the characters of each *script*, so that it can be used to
    represent text in all the languages that use that script. The
    character encoding standard, per se, doesn't do any language-specific
    categorizations.

    The point of the Unicode Collation Algorithm is to provide a
    default sorting for Unicode, along with a generic tailoring
    mechanism to allow people to customize it to produce
    language-specific sorting according to cultural conventions.
    Again, the algorithm, per se, doesn't do any language-specific
    categorizations.

    >
    > just checking. thanks.
    >
    >
    > On Thursday, May 15, 2003, at 11:03 pm, Ben Dougall wrote:
    >
    > > would it be the uca / collation
    > > <http://www.unicode.org/unicode/reports/tr10/> that will allow me to
    > > do this? :
    > >
    > > having specified which language is being used, compare one character
    > > to another and find out which various groupings they may or may not
    > > share. such as comparing in english, an 'F' and 'W' would match on
    > > case (and consonants even). case catagories i'm sure don't exist in
    > > some other languages, but then i'm sure there are many other types of
    > > catagorisations in other languages that english doesn't have.
    > >
    > > i'd like to have access to any kind of character catagories /
    > > groupings that maybe applicable to whichever language is initially
    > > specified.

    You need to start looking up *linguistic* sources for that kind
    of information.

    Unless what you are really after is just the list of characters
    needed to represent text for each language. In that case, then
    there are various online sources to get you started, as reported
    several times on this list. For example, see Indrik Hein's site:

    http://www.eki.ee/itstandard/ladina/

    > >
    > > is it the uca that's what i need to look into for that type of thing?

    No, I don't think so.

    > >
    > >
    > > also i notice icu <http://oss.software.ibm.com/icu/> has a lot of
    > > collation stuff. how does that compare to unicode's collation?, (if
    > > collation is even what i'm after, that is). how is icu different from
    > > unicode's collation?

    ICU provides an implementation of the Unicode Collation Algorithm.
    It conforms *to* UTS #10.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 15:31:47 EDT