Re: Amiguity(?) in Sinhala named sequences from Asmus Freytag on 2016-10-17 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Mon, 17 Oct 2016 09:52:48 -0700

On 10/17/2016 7:58 AM, Martin Jansche wrote:

Thanks for the pointer to the 2011 version of SLS 1134. After reading that and discussing further with Cibu, here's a tentative proposal:

* The most logical[*] interpretation of the sequence 0DBB 0DCA 200D 0DBA is as Repaya+Ya. A standard (Unicode and/or SLS) should call this out explicitly. ([*]Logical: In other scripts, including Devanagari, Myanmar, etc. similar types of modifiers that logically precede a letter are represented in this way, sometimes without ZWJ or with a different character in lieu of ZWJ. Also this interpretation plays well alongside a hypothetical alternative encoding of Yansaya using a single codepoint.)

* A standard (Unicode and/or SLS) should specify how Ra+Yansaya should be encoded. SLS 1134 points out that Ra+Yansaya is an incorrect spelling, yet in order to make this point it has to show the glyph sequence for Ra+Yansaya. So there is clearly some need to be able to render this, even if it's only at this meta-linguistic level. Plus SLS 1134 is very explicit that e.g. keyboarding should allow for letter combinations to be entered even if they are not practically useful. One possible way of encoding Ra+Yansaya is 0DBB 200C 0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders as intended in HarfBuzz with NotoSansSinhala, but not with LBhashitaComplex. If we had a clear directive regarding how Ra+Yansaya should be represented, we could work on getting fonts updated.

There are some didactic needs that aren't directly catered to by the standard. That is as it should be, especially, if you are intending to show things that "shouldn't exist".

* Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB 0DCA 200D 0DBB. This is much less relevant in practice, but the same arguments about ambiguity apply and should be resolved in the same way.

Regards,

-- martin

On Mon, Oct 17, 2016 at 12:15 AM, Harshula <harshula@hj.id.au> wrote:

Hi Martin,

On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are defined (for good reasons):
>
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
>
> I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
>
> Note that these give rise to two potentially ambiguous codepoint
> strings, namely
>
> 0DBB 0DCA 200D 0DBA
> 0DBB 0DCA 200D 0DBB
>
> I'll concentrate on the first, as all arguments apply to the second one
> analogously.
>
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
>
> 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
>
> First question: Does the standard give any guidance as to which one is
> the intended parse? The section on Sinhala in the Unicode Standard is
> silent about this. Is there a general principle I'm missing?
>
> Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> used and is considered incorrect, suggesting that the second parse
> (Repaya+Ya) should be the default interpretation of this sequence.
> However, SLS 1134 does not address the potential ambiguity of this
> sequence explicitly and the description there could be read as
> informative, not normative.

1) re: 0DBB 0DCA 200D 0DBA

SLS 1134 was updated in 2011 (The latest public version I could find is
v3.41. This extract is the same in v3.6.):
https://sourceforge.net/p/sinhala/mailman/attachment/4D957C56.5050204@cse.mrt.ac.lk/1/

"1. The yansaya is not used following the letter ර. e.g.: the spelling
කාර‍්‍ය is incorrect."

If the above is insufficient, it's best to discuss the issue with Harsha
(CC'd) and Ruvan (CC'd).

2) re: 0DBB 0DCA 200D 0DBB

Harsha & Ruvan can clarify this too.

cya,
#

> Second question: Given that one parse of this sequence should be the
> default, how does one represent the non-default parse?
>
> In most cases one can guess what the intended meaning is, but I suspect
> this is somewhat of a gray area. In practice, trying to render these
> problematic sequences and their neighbors in HarfBuzz with a variety of
> fonts results in a variety of outcomes (including occasionally
> unexpected glyph choices). If the meaning of these sequences is not well
> defined, that would partly explain the variation across fonts.
>
> Am I missing something fundamental? If not, it seems this issue should
> be called out explicit in some part of the standard.
>
> Regards,
> -- martin

Received on Mon Oct 17 2016 - 11:53:46 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 17 2016 - 11:53:47 CDT