Re: Indic Syllabic Categories from Richard Wordingham on 2015-02-25 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 25 Feb 2015 23:08:59 +0000

On Sat, 17 May 2014 11:56:35 +0100
Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:

> I've reviewed the application of the revised categories as set forth
> in L2/14-126
> (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as
> applied to the Thai, Lao and Tai Tham scripts, and noted a few other
> characters, and come up with the following proposed changes of
> syllabic category.

I've just submitted a slightly different set of changes via the Unicode
report function. They were updated to take into account other proposed
changes and also Microsoft's new 'Universal Shaping Engine'. The
submitted comment follows.

Richard.

I've reviewed the application of the revised categories as set forth in
L2/14-126 (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf)
as applied to the Thai, Lao and Tai Tham scripts, and noted a few other
characters, and come up with the following proposed changes of syllabic
category. I have also taken into account the proposals of Roozbeh
Pournader of 24 February 2015 related to work on the Universal Shaping
Engine.

I've come up with 3 new characters of category Bindu:
0303 ;Bindu # Mn COMBINING TILDE
0310 ; Bindu # Mn COMBINING CANDRABINDU
1A74 ; Bindu # Mn TAI THAM SIGN MAI KANG (currently Vowel_Dependent)

Note that both U+0ECD LAO NIGGAHITA and U+1A74 function both as Bindu
and as Vowel_Dependent. U+0303 is used in Patani Malay in the Thai
script - see UTC document L2/10-451. U+0310 is used for Sanskrit in
Tamil script, according to Indic list email 'Re: Tamil Punctuation',
27/7/12 9:24 +0530 from Shriramana Sharma.

I've found 4 new characters of category Visarga:
0E30 ; Visarga # Lo THAI CHARACTER SARA A
0EB0 ; Visarga # Lo LAO VOWEL SIGN A
1A61 ; Visarga # Mc TAI THAM VOWEL SIGN A
19B0 ; Visarga # Mc (to be Lo) NEW TAI LUE VOWEL SIGN VOWEL SHORTENER

Note that the tone (or voice modulation) character U+1038 MYANMAR SIGN
VISARGA is currently classified as Visarga. U+0E30 is used as visarga
in Sanskrit, e.g. in the Royal Institute Dictionary. The typical sound
of the four visargas above is /ʔ/ rather than /h/, and, through a
feature of Tai (SW Tai?) phonology, they all have the additional
function of shortening a vowel. As a vowel shortener, U+1A61 and U+19B0
may follow a final consonant.

These 4 characters are currently classified as Vowel_Dependent. Except
for the Lao script, that usage can easily be interpreted as a
modification of the implicit vowel. Modern Lao does not acknowledge the
existence of an implicit vowel, so that interpretation may be harder to
accept. (Vowel_Dependent U+0EB1 LAO VOWEL SIGN MAI KAN is also a vowel
shortener; in the 19th century it was denied that Vowel_Dependent
U+0E31 THAI CHARACTER MAI HAN-AKAT was a vowel in Thai.)

U+1A61 occasionally has the sound /k/, especially when used in
conjunction with U+1A62 TAI THAM VOWEL SIGN MAI SAT. I think we should
regard this as just one of the uses of visarga.

I've found 3 new nuktas, at least, so long as the application of nukta
is not restricted to *foreign* consonants.

0331 ; Nukta # Mn COMBINING MACRON BELOW
0359 ; Nukta # Mn COMBINING ASTERISK BELOW
1A7F ; Nukta # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT

U+0331 is used in Patani Malay in the Thai script - see L2/10-451 and
the consonant chart on p16 of
http://mlenetwork.org/sites/default/files/Patani%20Malay%20Presentation%20-%20Part%202.pdf. U+0331 and U+0359 have been used in English-Thai dictionaries to
represent English sounds, very much a nukta role. They were previously
classified as 'Other', though there is a proposal to make U+1A7F
'Syllable_Modifier'. U+0EC8 LAO TONE MAI EK functions as Nukta in Khmu
as well as performing its principal rôle of Tone_Mark in Lao. U+0E3A
THAI CHARACTER PHINTHU is used both as Nukta and as Pure_Killer; the
latter is its traditional rôle.

I've found 4 new pure killers, all
currently classified as 'Other', though there is a proposal to classify
U+0E4C (along with U+17CD) as 'Consonant_Killer'. They are:

0E4C ;Pure_Killer # Mn THAI CHARACTER THANTHAKHAT
0ECC ; Pure_Killer # Mn LAO CANCELLATION MARK
1A7C ; Pure_Killer # Mn TAI THAM SIGN KHUEN-LUE KARAN
1A7A ; Pure_Killer # Mn TAI THAM SIGN RA HAAM

U+0E4C THAI CHARACTER THANTHAKHAT and U+0E4E THAI CHARACTER YAMAKKAN
once divided the role of vowel killing - U+0E4E formed clusters and
U+0E4C removed final vowels. The use of U+0E4C came to be largely
restricted to vowels associated with clusters of consonants. Removing
the vowel made the final consonant of the cluster silent (spoken Thai
does not permit final consonant clusters), and from this effect it has
been reinterpreted as a consonant-killer. U+0ECC probably had the same
behaviour as U+0E4C. I don't know if it is still used in Laos - foreign
loanwords often don't follow the rules.

The Tai Tham marks are still at the transitional stage - they are
sometimes found on final unsubscripted consonants to indicate that they
have no vowel. There is an unfortunate overlap with the final consonant
mark for <r> (pronunciation necessarily /n/). The Khuen and Lue from of
the final consonant symbol has the same shape as the Thai and Lao form
of the pure killer. Consequently U+1A7A serves as Consonant_Final in
Tai Khuen and Tai Lue. In Tai Khuen, at least, the use as a final
consonant seems to have recently fallen into disfavour, so it seems
most appropriate to classify U+1A7A as 'Pure_Killer'. I noted above
that the 'Pure_Killer' U+0E3A THAI CHARACTER PHINTHU also serves as a
nukta. I have a vague recollection that U+0E4C THAI CHARACTER
THANTHAKHAT serves as a register mark in an orthography for the Chong
language, so that would count as an auxiliary rôle as Tone_Mark.

If 'Consonant_Killer' is to be separated from 'Pure_Killer', then we
need a separate category 'Dual_Mode_Killer' for U+1A7A and U+1A7C.

It should be noted that U+1A62 TAI THAM VOWEL SIGN MAI SAT serves not
only as Vowel_Dependent but also as Consonant_Final. This seems to be
chiefly relevant to anyone attempting to deduce the pronunciation from
the spelling.

There are 4 characters currently categorised as 'Consonant' which I
think are better categorised as 'Vowel':

0E24 ; Vowel # Lo THAI CHARACTER RU
0E26 ; Vowel # Lo THAI CHARACTER LU
1A42 ; Vowel # Lo TAI THAM LETTER RUE
1A44 ; Vowel # Lo TAI THAM LETTER LUE

They serve both as independent and dependent vowels. Note that U+0E24 and U+0E26 may be
followed by the length mark U+0E45 THAI CHARACTER LAKKHANGYAO, which is
categorised as 'Vowel_Dependent'. I am not aware of any usage of U+0E45
as a true vowel.

The sequence <U+1AAD TAI THAM SIGN CAANG, U+1A63 TAI THAM VOWEL SIGN
AA> occurs with the same meaning, 'elephant', as U+1AAD. I don't know
AA> whether this justifies changing U+1AAD from 'Other' to 'Consonant_Placeholder'.

I've found one new Consonant:

0EBD ; Consonant # Lo LAO SEMIVOWEL SIGN NYO (was Consonant_Medial)
0EDE ; Consonant # Lo LAO LETTER KHMU GO (was Other)

U+0EBD is used as an initial consonant in Khmu, so U+0EBD has been used
in all rôles in the Lao script, like U+0EA7 LAO LETTER WO, which is of
category Consonant. For information on Khmu usage, see UTC document
L2/10-335 (http://www.unicode.org/L2/L2010/10335r-n3893r-lao-hosken.pdf). The
Khmu alphabet chart included backs up the text. (It also shows U+0EC8
LAO TONE MAI EK acting as a Nukta!)

If 'repha' can be used as a general category, including for example
Myanmar script kinzi, then there are two arguable new examples,
currently categorised as Consonant_Final:

1A58 ; Consonant_Preceding_Repha? # Mn TAI THAM SIGN MAI KANG LAI
1A5A ; Consonant_Succeeding_Repha? # Mn TAI THAM CONSONANT SIGN LOW PA

There are significant issues with U+1A58; while traditionally it
behaves as repha/kinzi, some modern styles are better served by
treating it as Consonant_Final. It takes some juggling for a single
OTL-style rendering engine to be able to render either style depending
on the lookups while oblivious to the difference, but it can be done.

I've found 5 new instances of Consonant_Subjoined:
1A57 ; Consonant_Subjoined # Mc TAI THAM CONSONANT SIGN LA TANG LAI
1A5B ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN HIGH RATHA OR
LOW PA
1A5C ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN MA
1A5D ; Consonant_Subjoined # Mn TAI THAM TAI THAM CONSONANT SIGN BA
1A5E ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN SA

They were all previously categorised as Consonant_Final.

Note that U+1A57 is an abbreviation. It is derived by the addition of a
stroke to the subscript form <U+1A60 TAI THAM SIGN SAKOT, U+1A43 TAI
THAM LETTER LA>. Abbreviations of the word _tanglaai_ 'all' using U+1A57
normally include at least <U+1A57, U+1A63 TAI THAM VOWEL SIGN AA>, so
U+1A57 is not Consonant_Final. An example, apparently spelt <U+1A26
TAI THAM LETTER NGA, U+1A57, U+1A76 TAI THAM SIGN TONE-2, U+1A63 TAI
THAM VOWEL SIGN AA>, is given in Table 16 at
http://www.seasite.niu.edu/tai/TaiLue/graphic%20blends.htm.

The word ᨶᩥᨻᩛᩤᨶ <U+1A36 TAI THAM LETTER NA, 1A65 TAI THAM VOWEL SIGN I, 1A3B
TAI THAM LETTER LOW PA, 1A5B, 1A64 TAI THAM VOWEL SIGN TALL AA, 1A36>
_nippa:na_ 'nirvana' immediately demonstrates that U+1A5B is not a
final consonant. U+1A5C occurs in Pali proper names ending -mmo <U+1A3E
TAI THAM LETTER MA, U+1A5C, U+1A6E TAI THAM VOWEL SIGN E, U+1A63 TAI
THAM VOWEL SIGN AA>, so is clearly not a final consonant.

U+1A5D occurs in Northern Thai principally in one word, whose
pronunciation is roughly /kɔbɔː/. U+1A5D is not Consonant_Final in its
phonetic effect. The word is a compound word (or perhaps just a visual
compound), formed by chaining two syllables and striking out
the duplicated characters. I have a text in which the constituents are
to be encoded <U+1A20 TAI THAM LETTER HIGH KA, U+1A74 TAI THAM SIGN MAI
KANG> and <U+1A37 TAI THAM LETTER BA, U+1A74, U+1A75 TAI THAM SIGN
KANG> TONE-1>, so the chained word may reasonably be encoded <U+1A20,
KANG> U+1A74, U+1A5D, U+1A75> or <U+1A20, U+1A5D, U+1A74, U+1A75>.

While all my examples of U+1A5E are word final, it seems to differ from
<U+1A60, U+1A48 TAI THAM LETTER HIGH SA> on the basis of the room
available for it. Both forms are used as a word final consonant. The
only Pali consonant cluster ending in /s/ is /ss/, and that is written
using U+1A54 TAI THAM LETTER GREAT SA, so a non-final <s> will be rare.
(I'm finding /ks/ written with U+1A47 TAI THAM LETTER HIGH SSA due to
the application of RUKI.) However, I feel it would be rash to presume
that every example of U+1A5E will be a final consonant.

I have one new Consonant_Final:

0EDF ; Consonant_Final # Lo LAO LETTER KHMU NYO (was Consonant)

See UTC document L2/10-335 for evidence.

I have one possible new Consonant_subjoined:

1A7B ; Consonant_subjoined # Mn TAI THAM SIGN MAI SAM

The value of its Indic_Matra_Category, if relevant, should be recorded
as Top. U+1A7B is principally a repetition mark, indicating the
repetition of a word. As extensions of this role, it can also do at
least the following:

(1) Indicate a repeated (not geminate) consonant
(2) Indicate an omitted implicit vowel (one omits an implicit vowel by
replacing it with U+1A60)
(3) Indicate an epenthetic vowel (extension
of Role 2).

In rôle (1), it serves as a subjoined consonant. In rôles
(2) and (3), it serves as a dependent vowel. For a shaper that does
not constrain appearance, such as the Universal Shaping Engine, the
best categorisation is probably 'Consonant_subjoined'.

Although U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM
CONSONANT SIGN MEDIAL LA are named as medial consonants, too much
should not be read into such a description. Both are, very
occasionally, immediately preceded by vowels, and both may be followed
by <U+1A60 TAI THAM SIGN SAKOT, U+1A40 TAI THAM LETTER HIGH YA> and
<U+1A60, U+1A45 TAI THAM LETTER WA>. While the latter two sequences
most commonly represent vowels, the strictly consonantal cluster
<U+1A49 TAI THAM LETTER HIGH HA, U+1A56, U+1A60, U+1A45> starts a few
words beginning with the cluster /lw/. This is a behaviour the
Universal Shaping Engine of Microsoft currently disallows for medial
consonants.

We should therefore have:
1A55 ; Consonant_Subjoined #Mc TAI THAM CONSONANT SIGN MEDIAL RA
1A56 ; Consonant_Subjoined #Mn TAI THAM CONSONANT SIGN MEDIAL LA

I actually see no benefits for rendering engines in distinguishing Consonant_Medial and
Consonant_Subjoined, though the contrast may help in locating phonetic syllable boundaries.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Feb 25 2015 - 17:10:21 CST

This archive was generated by hypermail 2.2.0 : Wed Feb 25 2015 - 17:10:22 CST