Re: character groupings in various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 16 2003 - 20:18:18 EDT

Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

Previous message: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Maybe in reply to: Ben Dougall: "character groupings in various languages"
Next in thread: Ben Dougall: "Re: character groupings in various languages"
Reply: Ben Dougall: "Re: character groupings in various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Ben Dougall followed up:

> >> anyone? : uca and collation to ascertain various possible character
> >> groupings / catagorisations that are specific to various specified
> >> languages? to get some other matches, more than just an absolute match
> >> or not absolute match?
> >
> > Use of the collation algorithm to do this is probably overkill.
>
> why? it's quite important to me that i get accurate, across the board,
> hopefully numerous, full categorisations / groupings in various
> languages. < the more the better (so long as they're fairly well
> established, and well used groupings)

With the clarification that Ben provided below, it is now finally
becoming clear what he is after. The answer is:

Character Properties

And the source you need to investigate for that is the
Unicode Character Database. See:

http://www.unicode.org/Public/UNIDATA/UCD.html

And then start digging into the particular data files which will
provide you with extensive and detailed information about
all kinds of character properties.

Properties implicitly define sets of characters: all characters
with property X. And it is those sets of characters that Ben
has been groping towards in talking about "character groupings/
categorizations".

And that is the answer as to why the Unicode Collation Algorithm
is inappropriate. The UCA is all about defining collation weights and
ordering strings; it is not about the definition of properties
for characters.

> i was just using english as an example because it's the only one i
> know. the types of categorisations you get in english are obviously
> only applicable to english (well, maybe some other languages too, but
> certainly not all). i'm after a good handful of categorisations, for
> many languages that may be used / specified at the start of running
> this comparison.
>
> english:
>
> numerical/alphabetical/other
> alphabetical:
> upper/lower-case
> consonants/vowels
> numerical:
> can't think of any for that - numerical punctuation maybe . ,
> other:
> punctuation/symbolic/white space...? currency, seperators,
brackets??
> ? (getting on dodgy ground now)

This is what makes it clear that you are after character properties.
This kind of stuff is classic CTYPE character classification, and
the Unicode Character Database has all that and much more, in
great detail.

The disconnect here is that you are assuming that character classification
is language-specific. That is not at all the assumption that goes
behind the Unicode model of character properties. Just as the
Unicode Standard defines a *universal* character encoding, it
also assumes that the universal set of characters so encoded have
discoverable and essentially universal properties. And those
properties are enumerated in the Unicode Character Database.

Think of it this way: there is nothing language-specific (or
cultural conventional, for that matter) about the fact that
U+0031 DIGIT ONE is a numeric digit and has the value one.
While it might be the case that my particular language doesn't
ordinarily use '1' for numbers -- I might prefer some other
set of digits, e.g., the Myanmar digits for writing Burmese.
But that fact has no bearing on the classification of
U+0031 DIGIT ONE per se. The fact that a particular language
doesn't use a character doesn't change its classification for
some other use.

Consonants and vowels, as Ben noted in a later message, are
*not* character properties, but have to do with phonological
status of units of writing for various languages. Those are
an entirely different issue, and classification of characters
as consonants versus vowels may not even be possible for
some writing systems -- English is a fine example, since its
writing system is so irregular in the use of letters.

And casing is an issue of mapping *between* characters. That
is mostly language-independent, but there are some
language-specific conventions which set in for a few case
mappings. Those are detailed in:

http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

> if i put my mind to it i'm sure i could do an ad hoc one for english,
> although punctuation and symbols i'm not so sure on. BUT i'm after as
> many languages as possible. at least the main ones. i haven't got a
> *chance* of doing that myself, and i'm sure that this sort of thing has
> already been done before and i was thinking, that "been done before
> thing" might be uca / collation? and if it's not, what is it? what's
> the official / proper name that i'm missing?

UCD (Unicode Character Database) :-)

>
> again just to make clear - i've given an example of some types of
> groupings for english (and there may easily be other useful groupings
> for english chars that i've missed). i'm after the established
> groupings for the established languages. other languages may have many
> more character categorisations, on the other hand they may have less -
> i simply don't know. whatever the case, i'd like to get the tables
> and/or algorithms, i guess the form of them would be, to be able to
> find which character groupings a particular character is in for a
> particular language.
...

> categorisations must be language specific. case, for example, can not
> apply to all languages (does it? i'm sure it doesn't). the language
> must be specified first, and then the categorisations take place within
> that languages rules (i would have thought).
>

This misconceives the problem, since it assumes that language
identity is the high-order bit, and that character classifications
are going to be different for every language.

You *will* find edge cases, of course. For example, punctuation
characters have different conventions of usage in different
places, so that a ";" symbol might not be used the same way
in one country as another. But even such issues are not
really *language* issues so much as typographical conventions
issue. They correlate only rather poorly with language. A whole
series of languages might, for example, use French punctuation
conventions, not because they have anything to do with the
French language itself, but simply because they are spoken in
former French colonies where book publishing was done by
typographers used to French conventions.

String ordering *is* an issue for which language-specific
rules need to be established.

Character classification, with few exceptions, is not.

--Ken

Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Previous message: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Maybe in reply to: Ben Dougall: "character groupings in various languages"
Next in thread: Ben Dougall: "Re: character groupings in various languages"
Reply: Ben Dougall: "Re: character groupings in various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 21:02:02 EDT