Re: character groupings in various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 16 2003 - 20:18:18 EDT

  • Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

    Ben Dougall followed up:

    > >> anyone? : uca and collation to ascertain various possible character
    > >> groupings / catagorisations that are specific to various specified
    > >> languages? to get some other matches, more than just an absolute match
    > >> or not absolute match?
    > >
    > > Use of the collation algorithm to do this is probably overkill.
    >
    > why? it's quite important to me that i get accurate, across the board,
    > hopefully numerous, full categorisations / groupings in various
    > languages. < the more the better (so long as they're fairly well
    > established, and well used groupings)

    With the clarification that Ben provided below, it is now finally
    becoming clear what he is after. The answer is:

       Character Properties
       
    And the source you need to investigate for that is the
    Unicode Character Database. See:

    http://www.unicode.org/Public/UNIDATA/UCD.html

    And then start digging into the particular data files which will
    provide you with extensive and detailed information about
    all kinds of character properties.

    Properties implicitly define sets of characters: all characters
    with property X. And it is those sets of characters that Ben
    has been groping towards in talking about "character groupings/
    categorizations".

    And that is the answer as to why the Unicode Collation Algorithm
    is inappropriate. The UCA is all about defining collation weights and
    ordering strings; it is not about the definition of properties
    for characters.
     
    > i was just using english as an example because it's the only one i
    > know. the types of categorisations you get in english are obviously
    > only applicable to english (well, maybe some other languages too, but
    > certainly not all). i'm after a good handful of categorisations, for
    > many languages that may be used / specified at the start of running
    > this comparison.
    >
    > english:
    >
    > numerical/alphabetical/other
    > alphabetical:
    > upper/lower-case
    > consonants/vowels
    > numerical:
    > can't think of any for that - numerical punctuation maybe . ,
    > other:
    > punctuation/symbolic/white space...? currency, seperators,
    brackets??
    > ? (getting on dodgy ground now)

    This is what makes it clear that you are after character properties.
    This kind of stuff is classic CTYPE character classification, and
    the Unicode Character Database has all that and much more, in
    great detail.

    The disconnect here is that you are assuming that character classification
    is language-specific. That is not at all the assumption that goes
    behind the Unicode model of character properties. Just as the
    Unicode Standard defines a *universal* character encoding, it
    also assumes that the universal set of characters so encoded have
    discoverable and essentially universal properties. And those
    properties are enumerated in the Unicode Character Database.

    Think of it this way: there is nothing language-specific (or
    cultural conventional, for that matter) about the fact that
    U+0031 DIGIT ONE is a numeric digit and has the value one.
    While it might be the case that my particular language doesn't
    ordinarily use '1' for numbers -- I might prefer some other
    set of digits, e.g., the Myanmar digits for writing Burmese.
    But that fact has no bearing on the classification of
    U+0031 DIGIT ONE per se. The fact that a particular language
    doesn't use a character doesn't change its classification for
    some other use.

    Consonants and vowels, as Ben noted in a later message, are
    *not* character properties, but have to do with phonological
    status of units of writing for various languages. Those are
    an entirely different issue, and classification of characters
    as consonants versus vowels may not even be possible for
    some writing systems -- English is a fine example, since its
    writing system is so irregular in the use of letters.

    And casing is an issue of mapping *between* characters. That
    is mostly language-independent, but there are some
    language-specific conventions which set in for a few case
    mappings. Those are detailed in:

    http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

    > if i put my mind to it i'm sure i could do an ad hoc one for english,
    > although punctuation and symbols i'm not so sure on. BUT i'm after as
    > many languages as possible. at least the main ones. i haven't got a
    > *chance* of doing that myself, and i'm sure that this sort of thing has
    > already been done before and i was thinking, that "been done before
    > thing" might be uca / collation? and if it's not, what is it? what's
    > the official / proper name that i'm missing?

    UCD (Unicode Character Database) :-)

    >
    > again just to make clear - i've given an example of some types of
    > groupings for english (and there may easily be other useful groupings
    > for english chars that i've missed). i'm after the established
    > groupings for the established languages. other languages may have many
    > more character categorisations, on the other hand they may have less -
    > i simply don't know. whatever the case, i'd like to get the tables
    > and/or algorithms, i guess the form of them would be, to be able to
    > find which character groupings a particular character is in for a
    > particular language.
    ...

    > categorisations must be language specific. case, for example, can not
    > apply to all languages (does it? i'm sure it doesn't). the language
    > must be specified first, and then the categorisations take place within
    > that languages rules (i would have thought).
    >

    This misconceives the problem, since it assumes that language
    identity is the high-order bit, and that character classifications
    are going to be different for every language.

    You *will* find edge cases, of course. For example, punctuation
    characters have different conventions of usage in different
    places, so that a ";" symbol might not be used the same way
    in one country as another. But even such issues are not
    really *language* issues so much as typographical conventions
    issue. They correlate only rather poorly with language. A whole
    series of languages might, for example, use French punctuation
    conventions, not because they have anything to do with the
    French language itself, but simply because they are spoken in
    former French colonies where book publishing was done by
    typographers used to French conventions.

    String ordering *is* an issue for which language-specific
    rules need to be established.

    Character classification, with few exceptions, is not.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 21:02:02 EDT