Re: Specification of Encoding of Plain Text

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Tue, 10 Jan 2017 00:06:05 -0800
On 1/9/2017 2:24 PM, Richard Wordingham wrote:
Where, if anywhere, is the encoding of plain text specified?  I am
particularly concerned with the arrangement of the code sequences for
non-spacing abstract characters once one has determined an encoding for
the abstract characters.

For example, a naive reading of TUS 9.0 Section 16.4 Subsection
"Ordering of Syllable Components" would lead one to believe that the
word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
SIGN U, U+17C6 KHMER SIGN NIKAHIT>. 
Richard,

the group of Khmer experts that developed the recent label generation rules for root zone domain names considers that ordering the only one supported,  a specification you find here: https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf

That document states:

7.4 Context of COENG Sign (U+17D2)
The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting consonants must occur between two consonants. If it occurs between any other categories, it is not in a valid context so the label is not well formed. Further, the consonant following it must not include ឡ KHMER LETTER LA (U+17A1), ...

So, you are not alone in thinking that the COENG goes between consonants. 

Did they just make this up? No, they followed what is laid out in the standard:

Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf)

Subscript Consonants. Subscript consonant signs differ from independent consonant
characters and are called coeng (literally, “foot, leg”) after their subscript position. While a
consonant character can constitute an orthographic syllable by itself, a subscript consonant
sign cannot. Note that U+17A1 C khmer letter la does not have a corresponding subscript
consonant sign in standard Khmer.... Subscript consonant signs are used to represent any
consonant following the first consonant in an orthographic syllable.

and on page 624:

.... each of these [subscript consonant] signs is represented by the sequence of two characters: a
special control character (U+17D2 khmer sign coeng) and a corresponding consonant
character.

That text fixes the order MAIN CONSONANT + COENG OPERATOR + SUBSCRIPT CONSONANT
with suffficient clarity (as do all the examples and tables).
 

 However, on further investigation,
I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
U+17BB> would not be compliant with the Unicode standard.  Have I
missed anything?

In this example, your coeng operator U+17D2 is out of order, while it is followed by
a consonant, it does not in turn immediately follow the main consonant, because a
sign NIKAHIT is inserted in your example.

Again, from the Root Zone LGR document we find an explicit rule:

7.10 Context of NIKAHIT SIGN (U+17C6)
The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a consonant or a shifter or one of the subset of dependent vowels tagged “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e. vowel signs AA and U.

That would allow the NIKAHIT to be placed where you suggest, if it were not for the
rule on the coeng operator (7.4).

Now, it is a known fact that the label generation rules are slightly more restrictive than the rules for general text. (See also section 5 in that document).

See the text on p. 622 in TUS 9.0.0 where the following exception is noted:

"The subscript consonant signs in the Khmer script can be used to denote a final consonant,
although this practice is uncommon."

The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG + FINAL CONSONANT

Another exception that is noted on p. 623 is the following:

"While these subscript consonant signs are usually attached to a consonant character, they
can also be attached to an independent vowel character. Although this practice is relatively
rare, it is used in one very common word, meaning “to give.”"

Taken together, it would appear that, unless your example fits the first of these two exceptions,
the NIKAHIT in it is out of order.

(The label generation rules disallow both of these exceptions,
in an attempt to streamline the rules, sacrificing a number of potential domain names. Equivelant
rule sets for validating text would have to be more complete).

One might hope that the subsection about 'logical order' in TUS 9.0
Section 2.2 Unicode Design Principles would help, but:

1) Section 3 'Conformance' says nothing about logical order; and
2) The subsection about 'logical order' seems to assume that there
exists a common practice; it does not actually place any requirement
on this common practice. 

Richard.


I don't think either of these general sections are intended to provide the correct
or expected ordering of characters for complex scripts. Any preferred ordering that
doesn't result by happenstance from normalization would presumably be describe
in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.

http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf

A./


Received on Tue Jan 10 2017 - 02:06:49 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 10 2017 - 02:06:50 CST