Re: Specification of Encoding of Plain Text from Mark Davis ☕️ on 2017-01-13 (Unicode Mail List Archive)

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Fri, 13 Jan 2017 10:38:30 +0100

If you know of combining marks whose scx values should include Thai, please
let us know.

Also, by "Latin is not a complex script" I mean it in the narrow sense I
stated, where the goal is the ordering of characters. That is, nobody would
normally wonder whether 0.5 when expressed by a sequence with U+2044
FRACTION SLASH should be written as the sequence <2, U+2044 FRACTION SLASH,
1>!

There will always be some edge cases, but the target is Tibetan or Myanmar,
not Latin or Cyrillic.

Mark

On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 12 Jan 2017 21:03:29 +0100
> Mark Davis ☕️ <mark_at_macchiato.com> wrote:
>
> > That was just an example off the top of my head of the format for
> > using with regex; I don't pretend that it is vetted. Latin is not a
> > complex script, so it was only an illustration. However, it was just
> > brain freeze on my part to not also include Inherited or ZWJ. A more
> > serious effort would look at some of the issues from
> > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ
> > is not a problem: it is Mn
> > <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say)
> > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.
>
> Ah, I had not appreciated that sc=Inherited does not imply
> scx=Inherited. Using Script_Extensions to document the international
> combining characters that are used, for example, with Thai bases could
> have all sorts of undesirable knock-on effects.
>
> Richard.
>
>
Received on Fri Jan 13 2017 - 03:39:18 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 13 2017 - 03:39:19 CST