Re: Specification of Encoding of Plain Text

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Thu, 12 Jan 2017 21:03:29 +0100

That was just an example off the top of my head of the format for using
with regex; I don't pretend that it is vetted. Latin is not a complex
script, so it was only an illustration. However, it was just brain freeze
on my part to not also include Inherited or ZWJ. A more serious effort
would look at some of the issues from http://unicode.org/reports/tr29/, for
example. On the other hand, CGJ is not a problem: it is Mn
<http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say) U+064B
ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.

Mark

On Thu, Jan 12, 2017 at 7:42 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 12 Jan 2017 14:12:09 +0100
> Mark Davis ☕️ <mark_at_macchiato.com> wrote:
>
> > I agree that comprehension is a goal. I'd imagine using a BNF regex,
> > like the following. This is simple, since I'm just doing Latin, but
> > you can see what I mean.
>
> > word = base* ;
> > base = (latinLetter latinMn*) ;
> > latinLetter = [[:scx=Latn:]&[:L:]] ;
> > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
> >
> > which turns into the single regex expression:
> >
> > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*
>
> Ouch! That's alarmingly wrong. You've excluded the likes of
> English 'Ca‍esar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word
> doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
> of Thai SO SUEA as 's̄'. Fixinɡ it isn't easy. At least, I assume
> Arabic harakat don't attach to Latin letters in your conception of
> Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
> work well.
>
> The problem may be conflicting requirements on the Script_Extensions
> property.
>
> Richard.
>
>
Received on Thu Jan 12 2017 - 14:04:21 CST

This archive was generated by hypermail 2.2.0 : Thu Jan 12 2017 - 14:04:22 CST