Re: "ch" as in yecch

From: Mark E. Davis (
Date: Mon Oct 25 1999 - 12:28:24 EDT

I think this is a bit hasty. There are two related concepts.

- Locale-independent grapheme. Common across scripts. Characterized in the standard.
- Locale-dependent grapheme. May be one or more locale-independent graphemes. Specific to locale. (The
Unicode standard does not typically define locale-specific data.)

As always, one has to look at the particular process to see whether a concept is relevant. I've found that
many times people think that if you have a concept like grapheme, you *always* have to use APIs and
processing in terms of them. We've found quite the opposite; most of the time you don't have to worry about
them. A few cases where you do I have mentioned already, like making sure that the results of a search are
on grapheme boundaries. For that, you can have a boundary-detection API (like BreakIterator).

Locale-dependent graphemes are even more limited in scope; they don't necessarily affect anything but
collation (and processes that depend on collation order). Unlike the locale-independent graphemes, they
typically don't change rendering, they generally don't change case mappings, etc. Look at vs. dz. If
justification is turned on, we might letter-space dz, but never . If written in a vertical, non-rotated
environment (the Hotel-sign), we may or may not stack d and z, but never . When we titlecase dz, we do it
as if it were two letters: Dz, etc. Now of course, there are gray areas--some locale-dependent graphemes
like IJ that act more like locale-independent graphemes--but in general locale-dependent graphemes are only
relevant to collation, and perhaps keyboards (although keyboards can have sequences that are not graphemes,
also!). Except in special circumstances, you typically process locale-dependent graphemes like just
separate letters.


Tex Texin wrote:

> John,
> As we do in other areas of internationalization, there can be an
> algorithm that anticipates the variety and make customization for
> locales very easy. As there is no concept of GetNextGrapheme
> that incorporates "ch", and no set of properties defined for
> such an object, I feel that is lacking in the standard.
> There are allowances throughout the standard certainly, but the tactics
> I think could be more clearly defined.
> tex
> John Cowan wrote:
> >
> > Tex Texin scripsit:
> >
> > > I suddenly feel thrown back into the multi-codepage
> > > quagmire of researching researching, researching, and probably
> > > continually revamping and re-generalizing my software to accomodate
> > > new character types as I uncover them.
> >
> > There's no getting away from this. Some users of Latin treat "ch" as
> > unitary, and others don't. Some users of Cyrillic treat various
> > 2, 3, or 4-letter strings as unitary, and others don't. Localization
> > has to take this behavior into account.
> >
> > --
> > John Cowan
> > I am a member of a civilization. --David Brin
> --
> Progress Software: The #1 Embedded Database
> -------------------------------------------------------------------------------------------------------
> Tex Texin Director, International Products
> Progress Software Corp. Voice: +1-781-280-4271
> 14 Oak Park Fax: +1-781-280-4949
> Bedford, MA 01730 USA
> -------------------------------------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT