Re: character groupings in various languages

From: Ben Dougall (
Date: Fri May 16 2003 - 17:32:09 EDT

  • Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

    On Friday, May 16, 2003, at 07:58 pm, Kenneth Whistler wrote:

    > Ben Dougall asked:
    >> anyone? : uca and collation to ascertain various possible character
    >> groupings / catagorisations that are specific to various specified
    >> languages? to get some other matches, more than just an absolute match
    >> or not absolute match?
    > Use of the collation algorithm to do this is probably overkill.

    why? it's quite important to me that i get accurate, across the board,
    hopefully numerous, full categorisations / groupings in various
    languages. < the more the better (so long as they're fairly well
    established, and well used groupings)

    > If you are looking to make arbitrary sets of equivalences, such
    > as "all the consonants of English", then you should probably just
    > write your own ad hoc foldings.

    i was just using english as an example because it's the only one i
    know. the types of categorisations you get in english are obviously
    only applicable to english (well, maybe some other languages too, but
    certainly not all). i'm after a good handful of categorisations, for
    many languages that may be used / specified at the start of running
    this comparison.


                    can't think of any for that - numerical punctuation maybe . ,
                    punctuation/symbolic/white space...? currency, seperators, brackets??
    ? (getting on dodgy ground now)

    if i put my mind to it i'm sure i could do an ad hoc one for english,
    although punctuation and symbols i'm not so sure on. BUT i'm after as
    many languages as possible. at least the main ones. i haven't got a
    *chance* of doing that myself, and i'm sure that this sort of thing has
    already been done before and i was thinking, that "been done before
    thing" might be uca / collation? and if it's not, what is it? what's
    the official / proper name that i'm missing?

    again just to make clear - i've given an example of some types of
    groupings for english (and there may easily be other useful groupings
    for english chars that i've missed). i'm after the established
    groupings for the established languages. other languages may have many
    more character categorisations, on the other hand they may have less -
    i simply don't know. whatever the case, i'd like to get the tables
    and/or algorithms, i guess the form of them would be, to be able to
    find which character groupings a particular character is in for a
    particular language.

    >> am i on the right track there? or is there a better direction maybe?
    >> i'm looking for a reasonably even coverage of the main languages.
    > Don't expect to find such information out of a character encoding
    > standard or out of the Unicode Collation Algorithm.
    > The point of the character encoding standard is to encode all
    > the characters of each *script*, so that it can be used to
    > represent text in all the languages that use that script. The
    > character encoding standard, per se, doesn't do any language-specific
    > categorizations.

    categorisations must be language specific. case, for example, can not
    apply to all languages (does it? i'm sure it doesn't). the language
    must be specified first, and then the categorisations take place within
    that languages rules (i would have thought).

    > The point of the Unicode Collation Algorithm is to provide a
    > default sorting for Unicode, along with a generic tailoring
    > mechanism to allow people to customize it to produce
    > language-specific sorting according to cultural conventions.
    > Again, the algorithm, per se, doesn't do any language-specific
    > categorizations.

    well, ok, thanks for the info - but something must do right?

    thanks very much for the reply.

    >> just checking. thanks.
    >> On Thursday, May 15, 2003, at 11:03 pm, Ben Dougall wrote:
    >>> would it be the uca / collation
    >>> <> that will allow me to
    >>> do this? :
    >>> having specified which language is being used, compare one character
    >>> to another and find out which various groupings they may or may not
    >>> share. such as comparing in english, an 'F' and 'W' would match on
    >>> case (and consonants even). case catagories i'm sure don't exist in
    >>> some other languages, but then i'm sure there are many other types of
    >>> catagorisations in other languages that english doesn't have.
    >>> i'd like to have access to any kind of character catagories /
    >>> groupings that maybe applicable to whichever language is initially
    >>> specified.
    > You need to start looking up *linguistic* sources for that kind
    > of information.
    > Unless what you are really after is just the list of characters
    > needed to represent text for each language. In that case, then
    > there are various online sources to get you started, as reported
    > several times on this list. For example, see Indrik Hein's site:
    >>> is it the uca that's what i need to look into for that type of thing?
    > No, I don't think so.
    >>> also i notice icu <> has a lot of
    >>> collation stuff. how does that compare to unicode's collation?, (if
    >>> collation is even what i'm after, that is). how is icu different from
    >>> unicode's collation?
    > ICU provides an implementation of the Unicode Collation Algorithm.
    > It conforms *to* UTS #10.
    > --Ken

    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 18:57:26 EDT