Re: Amiguity(?) in Sinhala named sequences from Harshula on 2016-10-16 (Unicode Mail List Archive)

From: Harshula <harshula_at_hj.id.au>
Date: Mon, 17 Oct 2016 10:15:57 +1100

Hi Martin,

On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are defined (for good reasons):
>
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
>
> I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
>
> Note that these give rise to two potentially ambiguous codepoint
> strings, namely
>
> 0DBB 0DCA 200D 0DBA
> 0DBB 0DCA 200D 0DBB
>
> I'll concentrate on the first, as all arguments apply to the second one
> analogously.
>
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
>
> 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
>
> First question: Does the standard give any guidance as to which one is
> the intended parse? The section on Sinhala in the Unicode Standard is
> silent about this. Is there a general principle I'm missing?
>
> Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> used and is considered incorrect, suggesting that the second parse
> (Repaya+Ya) should be the default interpretation of this sequence.
> However, SLS 1134 does not address the potential ambiguity of this
> sequence explicitly and the description there could be read as
> informative, not normative.

1) re: 0DBB 0DCA 200D 0DBA

SLS 1134 was updated in 2011 (The latest public version I could find is
v3.41. This extract is the same in v3.6.):
https://sourceforge.net/p/sinhala/mailman/attachment/4D957C56.5050204@cse.mrt.ac.lk/1/

"1. The yansaya is not used following the letter ර. e.g.: the spelling
කාර‍්‍ය is incorrect."

If the above is insufficient, it's best to discuss the issue with Harsha
(CC'd) and Ruvan (CC'd).

2) re: 0DBB 0DCA 200D 0DBB

Harsha & Ruvan can clarify this too.

cya,
#

> Second question: Given that one parse of this sequence should be the
> default, how does one represent the non-default parse?
>
> In most cases one can guess what the intended meaning is, but I suspect
> this is somewhat of a gray area. In practice, trying to render these
> problematic sequences and their neighbors in HarfBuzz with a variety of
> fonts results in a variety of outcomes (including occasionally
> unexpected glyph choices). If the meaning of these sequences is not well
> defined, that would partly explain the variation across fonts.
>
> Am I missing something fundamental? If not, it seems this issue should
> be called out explicit in some part of the standard.
>
> Regards,
> -- martin
Received on Sun Oct 16 2016 - 18:16:33 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 16 2016 - 18:16:33 CDT