Re: Standardised Encoding of Text from Richard Wordingham on 2015-08-09 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 9 Aug 2015 22:03:37 +0100

On Sun, 9 Aug 2015 21:14:38 +0200
Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
> On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > On Sun, 9 Aug 2015 17:10:01 +0200
> > Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> > > For example, perhaps the addition of real data to CLDR for a
> > > "basic-validity-check" on a language-by-language basis.

> > CLDR is currently not useful. Are you really going to get Mayan
> > time formats when the script is encoded? Without them, there will
> > be no CLDR data.

> That is a misunderstanding. CLDR provides both locale (language)
> specific data for formatting, collation, etc., but also data about
> languages. It is not limited to the first.

I'm basing my statement on the 'minimal data commitment' listed in
http://cldr.unicode.org/index/cldr-spec/minimaldata .

If there is a sustained failure to provide 4 main data/time formats, the
locale may be removed.

> > > It might be
> > > possible to use a BNF grammar for the components, for which we are
> > > already set up.

> > Are you sure?

> I said "might be possible". That normally indicates that a degree of
> uncertainty. That is, "no, I'm not sure".

> There is no reason to be unnecessarily argumentative; it doesn't
> exactly encourage people to explore solutions to a problem.

I was responding to the 'for which we are already set up'. The problem
is that canonical equivalence can make it very difficult to specify a
syntax. The text segmentation appendices suggest that you have already
hit trouble with canonical equivalence; I suspect you have tools set up
to prevent such problems recurring.

With a view to analysing the effects of analysing the
rquirements of the USE, I investigated the effects of canonical
equivalence on regular expressions. I eventually discovered the
relevant mathematical theory - it replaces strings by 'traces', which
for our purposes are fully decomposed character strings modulo canonical
equivalence. I found very little interest in the matter on this list.

I gave the example of the regular expression

[:InPC=Top:]*[:InPC=Bottom:]*

Usefully converting that expression to specify NFD equivalents in
accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it
is doable. I have a feeling that some have claimed that an expression
like that is already in NFD.

> I don't think any algorithmic description would get all and only those
> strings that would be acceptable to writers of the language. What
> you'd end up with is a mechanism that had three values: clearly ok
> (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and
> somewhere in between.

What have you got against 8th derivatives? -:)

You are looking at a different issue to me. One of the issues is rather
that for a word of one syllable, there should only be one order per
meaning, appearance and pronunciation for a pair of non-commuting
combining marks. For non-Indic scripts, that is generally handled by
ensuring that different orders of non-commuting combining marks render
differently.

> If the goal for the script rules is to cover all languages customarily
> written with that script, one way to do that is to develop the
> language rules as they come, and make sure that the script rules are
> broadened if necessary for each language. But there is also utility
> to having the language rules, especially for high-frequency languages.

The language rules serve a different function. The sequence
"xxxxlttttuuupppp" is clearly not English, but it is a perfectly
acceptable string for sorting, searching and rendering.

Richard.
Received on Sun Aug 09 2015 - 16:05:15 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 09 2015 - 16:05:15 CDT