Re: Specification of Encoding of Plain Text

From: Richard Wordingham <>
Date: Fri, 13 Jan 2017 17:47:24 +0000

On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <> wrote:

> I believe that any attempt to define a "regex" that describes *all
> legal text* in a given script is a-priori doomed to failure.
> Part of the problem is that writing systems work not unlike human
> grammars in a curious mixture of pretty firm rules coupled to lists
> of exceptions. (Many texts by competent authors will contain
> "ungrammatical" sentences that somehow work despite or because of not
> following the standard rules). The Khmer issue that started the
> discussion showed that there can a be a single word that needs to be
> handled exceptionally.

It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caractères et divers signes des écritures khmères
pré-modernes et modernes employés pour la notation du khmer, du
siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
(, this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century. The Thai
Wikipedia page on the use of the script for Thai
( gives examples for final
consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
COENG NGO (ទ័្ង​ = ทั้ง).

> If you try to capture all the exceptions in the general rules, the
> set of rules gets complicated, but is also likely to be way too
> permissive to be useful.

If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

> The Khmer LGR for the Root Zone, for example, deliberately disallows
> the exception (in the word for "give") so that it can be stated (a)
> more compactly and (b) does not allow the exceptional sequencing of
> certain characters to become applicable outside the single exception.
> An LGR is concerned with *single* instances of each word. Even the
> most common word in a language can only be registered once in each
> zone.

A label does not have to be a single word. For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

> Even if the BNFs did nothing more than capture succinctly the
> information presented in text and tables, they would be useful.

> For scripts where things like ZWJ and CGJ are optional, it doesn't
> make sense to run them into the standard BNF - that just messes
> things up. It is much more useful to provide generic context
> information of how to add them to existing text.

> For example, the CGJ is really intended to go between letters. So,
> describe that context.

It can be quite useful next to combining marks. For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Received on Fri Jan 13 2017 - 11:47:56 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 13 2017 - 11:47:57 CST