RE: unicode and malayalam

From: John Hudson (tiro@tiro.com)
Date: Fri Dec 03 1999 - 16:15:14 EST


At 04:13 AM 03-12-99 -0800, Marco.Cimarosti@icl.com wrote:

>RajKumar wrote (about Malayalam):
>>reformed the script in 1973 and that script is being taught in
>>the schools but some of the old script is being written in the
>>newspapers magazines etc since it is better looking (according
>>to some - mostly old people and writers) and it saves space when
>>printed.

>This is the case for many languages and, maybe, is another thing to be
>considered in the "language tagging" discussion: languages change also over
>time, non only over space.

This is something we've considered in specifying language tags for
OpenType, and Malayalam is one of the languages for which we have two
language tags associated with the same script tag:

        Language ScriptTag LanguageTag

        Malayalam (Traditional) mlym MAL
        Malayalam (Reformed) mlym MLR

There are similar arrangements for Irish and for Georgian.

This tag system is internal to OpenType, and it differs from the language
tagging used elsewhere because it serves a different purpose. The function
of the OpenType language tags is to enable typographic variants and layout
features, either as exceptions to default script behaviour (e.g. Turkish
exceptions to Latin ligature forming) or by accessing language specific
stylistic variants using the /locl/ feature tag (e.g. traditional Serbian
forms of Cyrillic letters).

>So, tags like en-US, en-UK, en-IE, etc. do not exhaust the possibilities.
>Someone could also need things like en-ANGLOSAXON, en-ELISABETHAN, etc...
>Has anyone experienced the performance of en-UK spellcheckers or hyphenizers
>applied to Shakespeare's plays? A disaster!

Anglo-Saxon gets treated as a separate language in most tagging systems
which acknowledge it. A distinction might be made between East and West
Anglo-Saxon, although only the latter is supported in Unicode without
reliance on composing diacritics (there is no codepoint for oe with macron).

The trouble with trying to tag historical variants of English is that
grammar and spelling rules were only lately standardised. A spellchecker
which works for Shakespeare wouldn't necessarily work for Marlowe or Johnson.

>Someone already raised a similar issue about Chinese: zh-CH (for "simplified
>hanzi") and zh-TW (for "traditional hanzi") are poor ways to express script
>variations that are not really "national variants" (both the PRC and Taiwan
>claim to be the same country, btw!) but rather "different spelling
>traditions".

I agree.

The OpenType tags are neutral in this regard:

        Chinese (Simplified) hani CHS
        Chinese (Traditional) hani CHT

By the way, 'language' in this context really means 'script used in a
particular way'. For a language like English, the tag /ENG/ associated with
the script /latn/ happens to mean 'The Latin script as used to write the
English language'. But the tags /IRI/ and /IRT/ for Irish are both
associated with the /latn/ tag and mean, respectively, 'The Latin script as
used to write the Irish language in the modern orthography' and 'The Latin
script as used to write the Irish language in the traditional orthography'.

John Hudson

Tiro Typeworks
Vancouver, BC
www.tiro.com
tiro@tiro.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT