Re: Specification of Encoding of Plain Text

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 13 Jan 2017 17:47:24 +0000

On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <asmusf_at_ix.netcom.com> wrote:

> I believe that any attempt to define a "regex" that describes *all
> legal text* in a given script is a-priori doomed to failure.
>
> Part of the problem is that writing systems work not unlike human
> grammars in a curious mixture of pretty firm rules coupled to lists
> of exceptions. (Many texts by competent authors will contain
> "ungrammatical" sentences that somehow work despite or because of not
> following the standard rules). The Khmer issue that started the
> discussion showed that there can a be a single word that needs to be
> handled exceptionally.

It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caractères et divers signes des écritures khmères
pré-modernes et modernes employés pour la notation du khmer, du
siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
(http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century. The Thai
Wikipedia page on the use of the script for Thai
(https://th.wikipedia.org/wiki/%e0%b8%ad%e0%b8%b1%e0%b8%81%e0%b8%a9%e0%b8%a3%e0%b8%82%e0%b8%ad%e0%b8%a1%e0%b9%84%e0%b8%97%e0) gives examples for final
consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
COENG NGO (ទ័្ង​ = ทั้ง).

> If you try to capture all the exceptions in the general rules, the
> set of rules gets complicated, but is also likely to be way too
> permissive to be useful.

If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

> The Khmer LGR for the Root Zone, for example, deliberately disallows
> the exception (in the word for "give") so that it can be stated (a)
> more compactly and (b) does not allow the exceptional sequencing of
> certain characters to become applicable outside the single exception.
>
> An LGR is concerned with *single* instances of each word. Even the
> most common word in a language can only be registered once in each
> zone.

A label does not have to be a single word. For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

> Even if the BNFs did nothing more than capture succinctly the
> information presented in text and tables, they would be useful.

> For scripts where things like ZWJ and CGJ are optional, it doesn't
> make sense to run them into the standard BNF - that just messes
> things up. It is much more useful to provide generic context
> information of how to add them to existing text.

> For example, the CGJ is really intended to go between letters. So,
> describe that context.

It can be quite useful next to combining marks. For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Richard.
Received on Fri Jan 13 2017 - 11:47:56 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 13 2017 - 11:47:57 CST