Re: Standardised Encoding of Text

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Sun, 9 Aug 2015 17:10:01 +0200

While it would be good to document more scripts, and more language options
per script, that is always subject to getting experts signed up to develop
them.

What I'd really like to see instead of documentation is a data-based
approach.

For example, perhaps the addition of real data to CLDR for a
"basic-validity-check" on a language-by-language basis. It might be
possible to use a BNF grammar for the components, for which we are already
set up. For example, something like (this was a quick and dirty
transcription):

$word := $syllable+;
$syllable := $B [R C] (S R?)* (Z? V)? $O? $S?;
# UnicodeSets
$R := [\u17CC];
$C := [<consonant shifter>];
$S := [<subscript consonant><independent vowel sign>];
$V := [<dependent vowel sign>]
$Z := [:joiner:]
$O := [...]
$B := [[:sc=khmer:]&[:L:]-$R-$C-$S-$V-$Z-$O]

The more these could use existing properties,
like Indic_Positional_Category or IndicSyllabicCategory, the better.

Doing this would have far more of an impact than just a textual
description, in that it could executed by code, for at least a reference
implementation.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Sun, Aug 9, 2015 at 3:58 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Sun, 9 Aug 2015 14:46:31 +0300
> "Erkki I Kolehmainen" <eik_at_iki.fi> wrote:
>
> > Sorry, but I find myself having a serious problem in understanding
> > what this is about.
>
> In some cases the TUS lays down in detail the order of characters and
> their interpretation. While Europeans have canonical combining classes
> to standardise the order of combining marks, lesser breeds tend not to
> receive them. It gets even worse when combining marks are defined by
> the combination of control character(s) and what appears to be a base
> character. For example, the order for the Khmer script was laid
> down in great detail. Similarly, the order for Burmese was laid out in
> great detail. However, as support for other languages was added to
> the 'Myanmar' script, the ordering rules to cover the new characters
> were not promptly laid down.
>
> So the question is, how does one rectify the situation where the text
> in the Unicode Standard for a script is woefully inadequate.
>
> Richard.
>
Received on Sun Aug 09 2015 - 10:11:42 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 09 2015 - 10:11:42 CDT