Re: Indic Syllabic Categories from Richard Wordingham on 2014-05-17 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 17 May 2014 11:56:35 +0100

On Mon, 12 May 2014 18:43:04 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> My suggestion, for those who are interested in this topic, would be
> to review the relevant data files, implied script behaviors, and
> documents and proposals in the UTC document register -- and over the
> course of the next year participate in providing feedback on this
> topic and the data files, so that if/when the files and related
> properties become informative for Unicode 8.0 next year sometime,
> these questions and any concerns about the various edge cases as
> applied to Southeast Asian scripts, can be addressed before the
> properties become more difficult to update.

I've reviewed the application of the revised categories as set forth in
L2/14-126 (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf)
as applied to the Thai, Lao and Tai Tham scripts, and noted a few
other characters, and come up with the following proposed changes of
syllabic category. I present them here rather than submit them as
feedback *immediately*. Some of these changes are tentative and would
benefit from discussion.

I've come up with 3 new characters of category Bindu:

0303 ; Bindu # Mn COMBINING TILDE
0310 ; Bindu # Mn COMBINING CANDRABINDU
1A74 ; Bindu # Mn TAI THAM SIGN MAI KANG (was Vowel_Dependent)

Note that both U+0ECD LAO NIGGAHITA and U+1A74 function both as Bindu
and as Vowel_Dependent.

U+0303 is used in Patani Malay in the Thai script - see UTC
document L2/10-451. U+0310 is used for Sanskrit in Tamil script,
according to Indic list email 'Re: Tamil Punctuation', 27/7/12 9:24
+0530 from Shriramana Sharma.

I've found 4 new characters of category Visarga:

0E30 ; Visarga # Lo THAI CHARACTER SARA A
0EB0 ; Visarga # Lo LAO VOWEL SIGN A
1A61 ; Visarga # Mc TAI THAM VOWEL SIGN A
19B0 ; Visarga # Mc NEW TAI LUE VOWEL SIGN VOWEL SHORTENER

Note that the tone (or voice modulation) character U+1038 MYANMAR SIGN
VISARGA is currently classified as Visarga. U+0E30 is used as
visarga in Sanskrit, e.g. in the Royal Institute Dictionary.
The typical sound of the four visargas above is /ʔ/ rather
than /h/, and, through a feature of Tai (SW Tai?) phonology, they all
have the additional function of shortening a vowel. As a vowel
shortener, U+1A61 and U+19B0 may follow a final consonant.

These 4 characters are currently classified as Vowel_Dependent. Except
for the Lao script, that usage can easily be interpreted as a
modification of the implicit vowel. Modern Lao does not acknowledge
the existence of an implicit vowel, so that interpretation may be harder
to accept. (Vowel_Dependent U+0EB1 LAO VOWEL SIGN MAI KAN is also a
vowel shortener; in the 19th century it was denied that Vowel_Dependent
U+0E31 THAI CHARACTER MAI HAN-AKAT was a vowel in Thai.)

U+1A61 occasionally has the sound /k/, especially when used in
conjunction with U+1A62 TAI THAM VOWEL SIGN MAI SAT. I think we should
regard this as just one of the uses of visarga.

I've found 3 new nuktas, at least, so long as the application of nukta
is not restricted to *foreign* consonants.

0331 ; Nukta # Mn COMBINING MACRON BELOW
0359 ; Nukta # Mn COMBINING ASTERISK BELOW
1A7F ; Nukta # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT

U+0331 is used in Patani Malay in the Thai script - see L2/10-451 and
the consonant chart on p16 of
http://mlenetwork.org/sites/default/files/Patani%20Malay%20Presentation%20-%20Part%202.pdf.
U+0331 and U+0359 have been used in English-Thai dictionaries to
represent English sounds, very much a nukta role.

They were previously classified as 'Other'.

U+0EC8 LAO TONE MAI EK functions as Nukta in Khmu as well as performing
its principal rôle of Tone_Mark in Lao.

U+0E3A THAI CHARACTER PHINTHU is used both as Nukta and as Pure_Killer;
the latter is its traditional# U+1A7B is principally a repetition
mark. As extensions of this role, it # can also do at least the
following: # (1) Indicate a repeated (not geminate) consonant
# (2) Indicated an omitted implicit vowel (one omits an implicit virama
by # replacing it with U+1A60)
# (3) Indicate an epenthetic vowel (extension of Role 2)
rôle, and its current classification.

I've found 4 new pure killers, all currently classified as 'Other'.
They are:

0E4C ; Pure_Killer # Mn THAI CHARACTER THANTHAKHAT
0ECC ; Pure_Killer # Mn LAO CANCELLATION MARK
1A7C ; Pure_Killer # Mn TAI THAM SIGN KHUEN-LUE KARAN
1A7A ; Pure_Killer # Mn TAI THAM SIGN RA HAAM

U+0E4C THAI CHARACTER THANTHAKHAT and U+0E4E THAI CHARACTER YAMAKKAN
once divided the role of vowel killing - U+0E4E formed clusters and
U+0E4C removed final vowels. The use of U+0E4C came to be largely
restricted to vowels associated with clusters of consonants. Removing
the vowel made the final consonant of the cluster silent (spoken Thai
does not permit final consonant clusters), and from this effect it has
been reinterpreted as a consonant-killer.

U+0ECC probably had the same behaviour as U+0E4C. I don't know if it is
still used in Laos - foreign loanwords often don't follow the rules.

The Tai Tham marks are still at the transitional stage - they are
sometimes found on final unsubscripted consonants to indicate that they
have no vowel. There is an unfortunate overlap with the final
consonant mark for <r> (pronunciation necessarily /n/). The Khuen and
Lue from of the final consonant symbol has the same shape as the Thai
and Lao form of the pure killer. Consequently U+1A7A serves as
Consonant_Final in Tai Khuen and Tai Lue. In Tai Khuen, at least,
the use as a final consonant seems to have recently fallen into
disfavour, so it seems most appropriate to classify U+1A7A as
'Pure_Killer'.

I noted above that the 'Pure_Killer' U+0E3A THAI CHARACTER PHINTHU
also serves as a nukta. I have a vague recollection that
U+0E4C THAI CHARACTER THANTHAKHAT serves as a register mark in an
orthography for the Chong language, so that would count as an auxiliary
rôle as Tone_Mark.

I think I have found one new 'Vowel_Independent', U+1A53 TAI THAM
LETTER LAE, currently classified as 'Consonant'. However, it does not
freely combine with true dependent vowels. It does pleonastically
combine with U+1A6F TAI THAM VOWEL SIGN AE; U+1A53 arises as an
abbreviation for <U+1A43 TAI THAM LETTER LA, U+1A6F...>.

1A53 ; Vowel_Independent # Lo TAI THAM LETTER LAE - or is it?

It should be noted that U+1A62 TAI THAM VOWEL SIGN MAI SAT serves not
only as Vowel_Dependent but also as Consonant_Final. This seems to
be chiefly relevant to anyone attempting to deduce the pronunciation
from the spelling.

There are 4 characters currently categorised as 'Consonant' which I
think are better categorised as 'Vowel':

0E24 ; Vowel # Lo THAI CHARACTER RU
0E26 ; Vowel # Lo THAI CHARACTER LU
1A42 ; Vowel # Lo TAI THAM LETTER RUE
1A44 ; Vowel # Lo TAI THAM LETTER LUE

They serve both as independent and dependent vowels. Note that U+0E24
and U+0E26 may be followed by the length mark U+0E45 THAI CHARACTER
LAKKHANGYAO, which is categorised as 'Vowel_Dependent'. I am not aware
of any usage of U+0E45 as a true vowel.

The sequence <U+1AAD TAI THAM SIGN CAANG, U+1A63 TAI THAM VOWEL SIGN
AA> occurs with the same meaning, 'elephant', as U+1AAD. I don't know
whether this justifies changing U+1AAD from 'Other' to
'Consonant_Placeholder'.

I've found 2 new Consonants:

0EBD ; Consonant # Lo LAO SEMIVOWEL SIGN NYO (was Consonant_Medial)
0EDE ; Consonant # Lo LAO LETTER KHMU GO (was Other)

U+0EBD is used as an initial consonant in Khmu, so U+0EBD has been used
in all rôles in the Lao script, like U+0EA7 LAO LETTER WO, which is of
category Consonant. For information on Khmu usage, see UTC document
L2/10-335
(http://www.unicode.org/L2/L2010/10335r-n3893r-lao-hosken.pdf). The
omission of U+0EDE and U+0EDF is such a shock that I submitted an error
report as I was drafting the email. The Khmu alphabet chart
included backs up the text. (It also shows U+0EC8 LAO TONE MAI EK
acting as a Nukta!)

If 'repha' can be used as a general category, including for example
Myanmar script kinzi, then there are two arguable new examples,
currently categorised as Consonant_Final:

1A58 ; Consonant_Preceding_Repha? # Mn TAI THAM SIGN MAI KANG LAI
1A5A ; Consonant_Succeeding_Repha? # Mn TAI THAM CONSONANT SIGN LOW PA

There are significant issues with U+1A58; while traditionally it
behaves as repha/kinzi, some modern styles are better served by
treating it as Consonant_Final. It takes some juggling for a single
OTL-style rendering engine to be able to render either style depending
on the lookups while oblivious to the difference, but it can be done.

I've found 5 new instances of Consonant_Subjoined:

1A57 ; Consonant_Subjoined # Mc TAI THAM CONSONANT SIGN LA TANG LAI
1A5B ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN HIGH RATHA
OR LOW PA
1A5C ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN MA
1A5D ; Consonant_Subjoined # Mn TAI THAM TAI THAM CONSONANT SIGN BA
1A5E ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN SA

They were all previously categorised as Consonant_Final.

Note that U+1A57 is an abbreviation. It is derived by the
addition of a stroke to the subscript form <U+1A60 TAI THAM SIGN
SAKOT, U+1A43 TAI THAM LETTER LA>. Abbreviations of the word _tanglaai_
using U+1A57 normally include at least <U+1A57, U+1A63 TAI THAM VOWEL
SIGN AA>, so U+1A57 is not Consonant_Final.

The word ᨶᩥᨻᩛᩤᨶ <U+1A36 TAI THAM LETTER NA, 1A65 TAI THAM VOWEL SIGN I,
1A3B TAI THAM LETTER LOW PA, 1A5B, 1A64 TAI THAM VOWEL SIGN TALL AA,
1A36> _nippa:na_ 'nirvana' immediately demonstrates that U+1A5B is not a
final consonant.

U+1A5C occurs in the Pali proper names ending -mmo <U+1A3E TAI THAM
LETTER MA, U+1A5C, U+1A6E TAI THAM VOWEL SIGN E, U+1A63 TAI THAM VOWEL
SIGN AA>, so is clearly not a final consonant.

U+1A5D occurs in Northern Thai principally in one word, whose
pronunciation is roughly /kɔbɔː/. U+1A5D is not Consonant_Final in its
phonetic effect. The word is a compound word (or perhaps just a visual
compound), formed by chaining two syllables and striking out the
duplicated characters. I have a text in which the constituents are to
be encoded <U+1A20 TAI THAM LETTER HIGH KA, U+1A74 TAI THAM SIGN MAI
KANG> and <U+1A37 TAI THAM LETTER BA, U+1A74, U+1A75 TAI THAM SIGN
TONE-1>, so the chained word may reasonably be encoded <U+1A20,
U+1A74, U+1A5D, U+1A75> or <U+1A20, U+1A5D, U+1A74, U+1A75>.

While all my examples of U+1A5E are word final, it seems to differ from
<U+1A60, U+1A48 TAI THAM LETTER HIGH SA> on the basis of the room
available for it. Both forms are used as a word final consonant. The
only Pali consonant cluster ending in /s/ is /ss/, and that is written
using U+1A54 TAI THAM LETTER GREAT SA, so a non-final <s> will be
rare. (I'm finding /ks/ written with U+1A47 TAI THAM LETTER HIGH SSA
due to the application of RUKI.) However, I feel it would be rash to
presume that every example of U+1A5E will be a final consonant.

I have one new Consonant_Final:

0EDF ; Consonant_Final # Lo LAO LETTER KHMU NYO (was Other)

See UTC document L2/10-335 for evidence. I have already submitted this
omission as formal feedback.

I have one possible new Vowel_Dependent:

1A7B ; Vowel_Dependent # Mn TAI THAM SIGN MAI SAM

The value of its Indic_Matra_Category should be recorded as Top. I
suspect renderers need to apply rearrangement rules to this mark, but I
haven't experimented with other techniques yet.

U+1A7B is principally a repetition mark, indicating the repetition of
a word. As extensions of this role, it can also do at least the
following:
(1) Indicate a repeated (not geminate) consonant
(2) Indicate an omitted implicit vowel (one omits an implicit vowel
by replacing it with U+1A60)
(3) Indicate an epenthetic vowel (extension of Role 2).

In rôles (2) and (3), it serves as a dependent vowel.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sat May 17 2014 - 05:58:59 CDT

This archive was generated by hypermail 2.2.0 : Sat May 17 2014 - 05:59:01 CDT