Re: Specification of Encoding of Plain Text

From: Asmus Freytag <>
Date: Tue, 10 Jan 2017 08:59:37 -0800
These kinds of regexes are being developed in various contexts.

For example, there's a group developing regexes for Indic scripts for use with CSS. That effort focuses on the syllable, not least because concepts like "first-letter" used in CSS are not relevant to those scripts.

Then there is ICANN's root zone label generation rules project. That effort focuses on labels, not words. The difference is that it is acceptable for rules on labels to be slightly more restrictive (that is allow them to underproduce) while on the other hand, labels are not limited to actual words, so the rules overproduce. For Unicode's purposes, such overproduction is not necessarily harmful, because ordinary text can contain words as well as domain names. However, any underproduction would have to be remedied, no matter how complex it would make the rule system.

As a matter of practical experience, being involved in the ICANN project mentioned, we've discovered that breaking up these rules makes them easier to understand and handle.

We are finding that the majority of rules can be expressed as left-context for a given character ( example: X must be preceded by any of ... ).

We find that right contexts or dual context are less often needed; however, there are some constructs that occur in syllable final or word-final position only, modeling those contexts is more complex.

However, the main motivation for having context rules as part of label generation rules is to prevent characters from occurring in contexts where rendering engines may not be able to deal with them, or, alternatively, to eliminate potential alternate orderings that are intended to mean the same thing.

We find that for those two purposes, trying to model the full syllable is practically never required.


On 1/10/2017 1:11 AM, Mark Davis ☕️ wrote:
What I really wish we had would be a machine readable set of regexes for each complex script (and for each language-script combination that is different than the default for that script).

Such a regex R could be used for determining the well-formed ordering of code points within words. The regex need not be for syllables, or grapheme clusters, or any other formal construct. The only requirement it would need to fulfill is that you could determine well-formed words with:

word := (R)+

That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV would pass the text, but CCV would fail. Ideally R would be as simple as possible (but no simpler).


On Tue, Jan 10, 2017 at 9:06 AM, Asmus Freytag <> wrote:
On 1/9/2017 2:24 PM, Richard Wordingham wrote:
Where, if anywhere, is the encoding of plain text specified?  I am
particularly concerned with the arrangement of the code sequences for
non-spacing abstract characters once one has determined an encoding for
the abstract characters.

For example, a naive reading of TUS 9.0 Section 16.4 Subsection
"Ordering of Syllable Components" would lead one to believe that the
word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,

the group of Khmer experts that developed the recent label generation rules for root zone domain names considers that ordering the only one supported,  a specification you find here:

That document states:

7.4 Context of COENG Sign (U+17D2)
The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting consonants must occur between two consonants. If it occurs between any other categories, it is not in a valid context so the label is not well formed. Further, the consonant following it must not include ឡ KHMER LETTER LA (U+17A1), ...

So, you are not alone in thinking that the COENG goes between consonants. 

Did they just make this up? No, they followed what is laid out in the standard:

Page 621 in Unicode 9.0.0, you find (

Subscript Consonants. Subscript consonant signs differ from independent consonant
characters and are called coeng (literally, “foot, leg”) after their subscript position. While a
consonant character can constitute an orthographic syllable by itself, a subscript consonant
sign cannot. Note that U+17A1 C khmer letter la does not have a corresponding subscript
consonant sign in standard Khmer.... Subscript consonant signs are used to represent any
consonant following the first consonant in an orthographic syllable.

and on page 624:

.... each of these [subscript consonant] signs is represented by the sequence of two characters: a
special control character (U+17D2 khmer sign coeng) and a corresponding consonant

with suffficient clarity (as do all the examples and tables).

 However, on further investigation,
I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
U+17BB> would not be compliant with the Unicode standard.  Have I
missed anything?

In this example, your coeng operator U+17D2 is out of order, while it is followed by
a consonant, it does not in turn immediately follow the main consonant, because a
sign NIKAHIT is inserted in your example.

Again, from the Root Zone LGR document we find an explicit rule:

7.10 Context of NIKAHIT SIGN (U+17C6)
The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a consonant or a shifter or one of the subset of dependent vowels tagged “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e. vowel signs AA and U.

That would allow the NIKAHIT to be placed where you suggest, if it were not for the
rule on the coeng operator (7.4).

Now, it is a known fact that the label generation rules are slightly more restrictive than the rules for general text. (See also section 5 in that document).

See the text on p. 622 in TUS 9.0.0 where the following exception is noted:

"The subscript consonant signs in the Khmer script can be used to denote a final consonant,
although this practice is uncommon."


Another exception that is noted on p. 623 is the following:

"While these subscript consonant signs are usually attached to a consonant character, they
can also be attached to an independent vowel character. Although this practice is relatively
rare, it is used in one very common word, meaning “to give.”"

Taken together, it would appear that, unless your example fits the first of these two exceptions,
the NIKAHIT in it is out of order.

(The label generation rules disallow both of these exceptions,
in an attempt to streamline the rules, sacrificing a number of potential domain names. Equivelant
rule sets for validating text would have to be more complete).

One might hope that the subsection about 'logical order' in TUS 9.0
Section 2.2 Unicode Design Principles would help, but:

1) Section 3 'Conformance' says nothing about logical order; and
2) The subsection about 'logical order' seems to assume that there
exists a common practice; it does not actually place any requirement
on this common practice. 


I don't think either of these general sections are intended to provide the correct
or expected ordering of characters for complex scripts. Any preferred ordering that
doesn't result by happenstance from normalization would presumably be describe
in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.


Received on Tue Jan 10 2017 - 11:00:19 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 10 2017 - 11:00:19 CST