(long) Making orthographies computer-ready (was *not* Telephoning Tamil)

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon Jul 29 2002 - 16:56:36 EDT

There are always consequences...

... but I am saying that you could build a locale that would work. Generally speaking, most programming environments do not look at the Unicode character database for the operations in question, or at least, don't look directly that those tables. They use custom generated tables or code. For example, from what I know of Java's internal structure, it would be relatively easy to construct the necessary classes.

For example, you can create a rule string for RuleBasedCollator that does collation of @, since the collator doesn't look at the character properties when performing sorting (normalization is another matter, though). A BreakIterator can be fashioned that doesn't break on the @ character. Localized strings (as in DateFormat's list of month names, for example) are just strings. And so on.

The consequences would generally come into play when you encounter code that DOES look at Unicode properties (or looks at a table that is not locale-driven). You'll get transient failures in that case.

IOW> the Unicode properties are not just guides. Building "complete Unicode support" means taking all the special cases and special pleading into account. Creating a new orthography for a minority language should probably take this into account, since what one is doing in a small, insular community may be ignored or resisted by Unicode implementers, especially if the result cannot be easily fit into existing support mechanisms.

The best course of action, if you have the freedom to pursue it, is to choose characters that have properties similar to those of the orthographic unit you are mapping. "@" has lots of problems (it isn't legal as a "word-part" in a URL, for example), it is identified as punctuation (so code that doesn't know about your locale may word- or line-break on it), it has no case mapping (so you're at the mercy of SpecialCasing, etc.). It is likely that any special cases that you create for ASCII characters will be more of an annoyance for Unicode implementers and thus tend not to be supported. Avoiding the creation of special cases is a Good Idea.

There are, of course, several orthographies, some with quite large speaker populations, that have this potential issue. One that occurs to me might be the Khoisan languages of Africa, which I believe commonly use "!" (U+0021) for a click sound. This is almost exactly the same problem you are describing for Tongva.

Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will note that almost without exception the entries are locale driven. The first stop in creating a new orthography (or computerizing an existing one, perhaps from the days of the typewriter), for my money would probably be to get ISO-639 to issue the language a 2-letter code so you can have locale (and Unicode character database) data tagged with it ;-).

Best Regards,


Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Curtis Clark
> Sent: Friday, July 26, 2002 11:23 PM
> To: unicode@unicode.org
> Subject: Re: REALLY *not* Tamil - changing scripts (long)
> Addison Phillips [wM] wrote:
> > Obviously I'm not an expert in these linguistic areas (and hence
> > rarely comment on them), but it seems to me that the lack of other
> > mechanisms makes Unicode an attractive target for criticism in this
> > area.
> Certainly no Unicode-bashing was intended (I'm more of a Unicode
> evangelist). I guess I'm confused about the use of Unicode character
> properties. Are you saying that, even though Unicode defines U+0027 as
> punctuation, other, I could use it as a glottal stop and create a locale
> that would treat it as a letter (and still be "Unicode compliant",
> whatever that is?). And if that's the case, are the Unicode properties
> just guides? Could I develop an orthography where Yßяبձ⁋ would be a
> word, and there would be no consequences?
> --
> Curtis Clark http://www.csupomona.edu/~jcclark/
> Mockingbird Font Works http://www.mockfont.com/

This archive was generated by hypermail 2.1.2 : Mon Jul 29 2002 - 15:00:45 EDT