PRI 359 Dispositionof feedback

L2/18-109

PRI 359 Disposition of feedback

Bob Hallissy, Lorna Evans
2018-03-23

Date/Time: Mon Sep 25 15:12:22 CDT 2017

Name: Thomas Milo

Report Type: Public Review Issue

Opt Subject: Proposed Draft UTR #53, Unicode Arabic Mark Ordering Algorithm Now Available for Public Review

Please consider taking into account the established solutions for these sequences as already implemented in www.mushafmuscat.om, which is now available world-wide as the authoritative, Azhar-recommended electronic reference Qur’ān.

I don’t expect fundamental disagreements, but the project handles and solves all spelling issues without extending the existing Unicode repertoire for Arabic.

However, for one class of characters we improved the behaviour by changing their typographical behaviour from overstrike to a new category of contextual behaviour: amphibious. I’ve reported about Amphibious Characters to the UTC.

Some practical tips:

Clicking the splash screen opens the text.

Words can be searched in Manuscript View Mode, which presents the verses separated by flowers, surrounded by navigation and graphic controls. On the left top are the page number, text search and version locator boxes. Chapters can be located with the Wheel on the left top.

Historical text layers can be exposed with the Colour Triangle at the left bottom. The dormant miniatures of unpointed characters can be activated with the حسصطعه icon at the left bottom. The chapter headings are in unpointed palaeographic Arabic; a ٮٮٯٮط / تنقيط icon (on all pagespreads except the first) provides two optional styles of pointing.

Clicking in the margin brings up the Printed Mushaf View Mode, with verses marked by numbers.

The Unicode structure can be found by clicking in the text, which brings up the Interactive View Mode.

Letter blocks light up with mouse-over (selecting a letter block also activates the WordShaping interface, which provides aesthetic user interaction without touching the Unicode structure).
Single-click selects letter block
Double-click selects word
Triple-click selects verse

CTRL+C (windows) or CMD+C (Mac) copies the selected Unicode string.

Caveat: we are preparing an update that positions all Qur’ānic stops to word final position, where they belong. This change will affect a few words that end in U+06E6 Arabic Small Yeh and U+06E5 Arabic Small Waw.

Some background information:
https://www.egypttoday.com/Article/4/14269/The-world%E2%80%99s-first-e-Quran-is-here
https://oumma.com/premiere-mondiale-coran-numerique-presente-a-mascate-oman/
The presenting the project to the crown prince of Oman, HRH Sayyid Haytham
https://www.youtube.com/watch?v=sHtBL2GvBxE
My speech without voice-over
https://www.youtube.com/watch?v=UpxsWGxgJIo

Please don’t hesitate to ask for more clarification if needed.

No action taken. Although many of the examples cited in the UTR are from the Quran, AMTRA is not intended purely for Quranic uses but also for practical orthographies that use Arabic script.

Date/Time: Mon Sep 25 19:54:28 CDT 2017

Name: A./

Report Type: Public Review Issue

Opt Subject: PRI 359

1.) Better guidance should be given when to apply this algorithm. From reading the draft, it is usefully applied as a standard preparatory step before handing text off to a rendering engine, or perhaps also as a standard transformation on input to a rendering engine. This should be explicitly
stated.

1) In order to explicitly identify the scope as the rendering pipeline and not storage, the initial Summary section was changed from:

The Unicode Arabic Mark Ordering Algorithm (UAOA) describes an algorithm for determining correct rendering of Arabic combining mark sequences

to the following:

This technical report specifies an algorithm that can be utilized during rendering for determining correct display of Arabic combining mark sequences.

This UTR makes no change to Unicode normalization forms, and does not propose a new normalization form. Instead, this is similar to the processing used in https://docs.microsoft.com/en-us/typography/script-development/use: a transient process which is used to reorder text for display in an internal rendering pipeline. This reordering is not intended for modifying original text, nor for open interchange.

2.) If there are other situations, operations or processes where transforming Arabic text using this algorithm are seen as useful, these should be stated explicitly.

2) The only section of the UTR that suggests other possible uses of the algorithm was enhanced to be more explicit:

5.6 Other uses for AMTRA
There is no intention or expectation that AMTRA would be applied to stored text. However, there may still be situations unrelated to rendering where AMTRA may be useful, and this UTR does not prohibit such use.

As an example, when a text editor is processing a backspace key, a decision has to be made about what character(s) should be removed from the text. For sequences involving combining marks, if the desire is to remove one mark at a time, users may have an expectation that the outermost marks should be removed first. For Arabic script the AMTRA could be used to identify outermost marks.

3.) There are situations and protocols that demand text in a given normalization form. Care should be taken in presenting the new algorithm so that it does not lead users to expect that all Arabic text "out to be" always in the transformed format.

3) As mentioned above, the Summary statement was enhanced to state:

This UTR makes no change to Unicode normalization forms, and does not propose a new normalization form.

and section 5.6 was enhanced to explicitly state:

There is no intention or expectation that AMTRA would be applied to stored text.

4.) The stability note before 3.2 could be improved. The word "existing" will change meaning. Therefore:

The set of MCM characters is intended to be stable. Characters from Unicode Version XXXX or earlier will not be added or removed from this set in future updates of this algorithm. Future updates may add characters to the set only if they were encoded in any version after XXXX.

[The future version of the algorithm then changes XXXX to the latest value. This wording allows the TR to skip any versions of the Unicode Standard that do not contain new combining marks in Arabic.]

4) The original text:

For stability reasons, existing Unicode characters will not be added to the list of MCM.

was replaced with:

The set of MCM characters is stable. Characters from Unicode Version 10.0 or earlier will not be added or removed from this set in future updates to this UTR. Characters added after version 10.0 may be added to MCM at the time they are incorporated into the standard but not after.

5.) In step 2, the specification does not address keeping multiple instances, e.g. multiple MCM, in relative order when moved "to the beginning". The current text could be interpreted as requiring multiple instances of such character to be inverted in relative order as each is moved "to the beginning". (The issue theoretically exists for shadda as it is defined by CCC value, which on the face of it allows the possibility of multiple distinct shadda code points where again, internal ordering could be observable).

5) The original text:

b. If a sequence of ccc=230 characters begins with any MCM characters, move those MCM to the beginning of S (before any characters with ccc=33).

c. If a sequence of ccc=220 characters begins with any MCM characters, move those MCM to the beginning of S (before any MCM with ccc=230).

was replaced with text clarifying that it is the sequence of MCM that is to be moved:

b. If a sequence of ccc=230 characters begins with any MCM characters, move the sequence of such MCM characters to the beginning of S (before any characters with ccc=33).
c. If a sequence of ccc=220 characters begins with any MCM characters, move the sequence of such MCM characters to the beginning of S (before any MCM with ccc=230 or ccc=33).

Date/Time: Fri Oct 6 05:59:06 CDT 2017

Name: r12a

Report Type: Public Review Issue

Opt Subject: When should UAOA be used?

I'm sending this on behalf of the W3C i18n WG. It relates to UTR#53.

I'm hearing through other channels that the algorithm described is intended to just indicate how characters should be temporarily reordered prior to rendering, rather than describe the order in which code points should be stored. Since most fonts generally produce the behaviour described anyway, it presumably therefore amounts to documenting expectations in terms of font behaviour, rather than specifying a new form of normalisation.

It's not at all clear from the document that that is the case, however, which has caused the W3C WG significant alarm (and wasted discussion cycles). Please update the document to make this clearer. We will hold back the other comments we currently have queued up to send until we can re-evaluate them in the light of the changes to the document.

r12a correctly identifies the intended use of the algorithm: transient reordering used during rendering and not a new form of normalization. We hope the changes mentioned above make that more explicit.

In contrast to r12a’s statement, the authors are unaware of any fonts that “generally produce the behaviour described” in the draft UTR.

Btw, the understanding of the intended use of UAOA is not helped by the way the document mentions canonically equivalent character sequences, nor by the vague descriptions of when CGJ should be used.

The Unicode Standard 10.0, in section 5.13, states:

Canonical equivalence must be taken into account in rendering multiple accents, so that any two canonically equivalent sequences display as the same.

A corollary of this is: if the text author wants two sequences to display differently, those sequences must not be canonically equivalent. As further stated in The Unicode Standard 10.0, section 23.2:

[The CGJ] is also used to distinguish sequences that would otherwise be canonically equivalent.

In that the intent of this UTR is to provide a mechanism to support rendering, the authors consider it to be within the scope of this UTR to address issues related to canonical equivalence of texts being rendered.

Date/Time: Fri Oct 6 06:05:21 CDT 2017

Name: r12a

Report Type: Public Review Issue

Opt Subject: AMOA rather than UAOA ?

http://www.unicode.org/reports/tr53/

"The Unicode Arabic Mark Ordering Algorithm (UAOA)"

I find it difficult to figure out how one should pronounce UAOA and difficult to pronounce either way. I think AMOA (or even UAMAO) would be easier. Please consider that or some other change.

The name of the algorithm was changed to “Arabic Mark Transient Reordering Algorithm” which has the more pronounceable acronym “AMTRA”

Date/Time: Tue Oct 10 09:40:48 CDT 2017

Name: David Corbett

Report Type: Public Review Issue

Opt Subject: PRI #359: U+08D9 ARABIC SMALL LOW NOON WITH KASRA

U+08D9 ARABIC SMALL LOW NOON WITH KASRA has Canonical_Combining_Class=Above
when it should have been Below. Could the UAOA reorder it as Below?

No action taken. As defined, AMTRA will always order U+08D9 after all Below (ccc=220) marks.

While it is technically possible to alter AMTRA such that U+08D9 is treated throughout the algorithm as if it were ccc=220 so that it maintains its position relative to other ccc=220 marks, doing so would not guarantee consistent rendering since text processes (prior to rendering) are free to reorder U+08D9 relative to any ccc=222 marks — thus resulting in different rendering for canonically equivalent texts.

Date/Time: Fri Oct 13 16:48:21 CDT 2017

Name: Behnam Esfahbod

Report Type: Public Review Issue

Opt Subject: Feedback on Proposed Draft UTR #53 — Revision 1

Status: Liaison Contribution - W3C i18n WG

# Using UAOA in Text Editing
On Section 5.6 “Other uses for UAOA”, we have:

> > UAOA is very useful in implementations of backspacing in cases where
> > there is no external information available about the original order
> > in which the text was entered.

For an average user of modern languages using the script, reordering the marks entered on a keyboard would be unexpected behavior.

Basically, the document is suggesting that when user authors a text file with Arabic Marks put in a specific order, when the files is closed and reopened, the backspace should behave differently from the previous session.

Also, it is not clear at all if UAOA will be useful in a text editing scenario. The claim for UAOA to be "very useful" needs some evidence, like existing implementation or some other data to support it.

As mentioned above, Section 5.6 was rewritten. In particular, the rewrite states:

There is no intention or expectation that AMTRA would be applied to stored text.

The rewrite also provides a more detailed discussion about why a text editor may want to utilize AMTRA within backspace processing, but the UTR does not require such.

From the language and examples of the document, it looks like the usage of the algorithm is too focused on one application, Quranic text, and the claims are related only to that specific application of the script.

While Quranic texts were the easiest examples to find, the algorithm is not specific to such.

Date/Time: Fri Oct 13 16:59:35 CDT 2017

Name: Behnam Esfahbod

Report Type: Public Review Issue

Opt Subject: Feedback on Proposed Draft UTR #53 — Revision 1

Status: Individual Contribution

The way Unicode Normalization works for Arabic Marks indeed has its problems, specially in font development and text rendering. The algorithm proposed in this PDUTR is a good way to address some of these problem. But, the document needs improvements in a few areas to be clear about what it does, when it should be applied, how it should be used, and what to expect from it.

# 1. Scope of the PDUTR

It looks like the PDUTR is the first UTR focused on details of rendering of Unicode text (besides the text of the Unicode Standard). Arabic is only one of the scripts that need some special attention (possibly reordering of the characters in memory) for rendering. It could be a better approach to have a document (UTR) focused on text rendering, which would also contain this algorithm for Arabic script, and would collect other best-practices over time, for other issues of rendering Arabic script, as well as other scripts

#1. UTC has recognized that other scripts may need similar special attention.

# 2. Scope of the algorithm

The scope of the algorithm is not clear, neither in its title nor in the language.

The name “Unicode Arabic Mark Ordering Algorithm” is suggesting that this is expected to be the only way Arabic Marks should be ordered in Unicode. That’s clearly not the case. In fact, the document is proposing an algorithm for “reordering” Arabic Marks (not just how they should be ordered) to solve a problem in “rendering” of the script. The title need to be clear about this. Maybe “Unicode Arabic Mark Reordering Algorithm for Rendering” (AMRAR)?

Similarly, the Section 2 “Background” doesn’t clarify the scope of the algorithm and only explains how something is not working for some specific application with the existing normalization methods.

#2. As mentioned above, the initial Summary statements were enhanced to clarify intended use of the algorithm.

# 3. Consequences of the Algorithm: Normalization

The draft proposal is not clear about the effects of applying the algorithm on text. Specially, for strings X for which this algorithm is useful, we have UAOA(toNFC(X)) ≠ toNFC(UAOA(X)).

So, although the behavior of the algorithm can be stabilized over Unicode verions, it’s very important how and when it’s applied to the text, since it changes a text in normalized form to a non-normalized form. Therefore, in terms of normalization, the algorithm cannot be considered stable at all. The document needs to be clear about this, even though it’s obvious from a technical point of view.

# 4. Consequences of the Algorithm: Semantics

With UAOA applied on text during rendering, some strings collapse into a single sequence. Basically, there are plenty of strings X and Y, where toNFC(X) ≠ toNFC(Y), but UAOA(toNFC(X)) = UAOA(toNFC(Y)).

Basically, this is changing the semantics of existing text encoded in Unicode, since the rendering will be different afterwards. The document is not clear about this semantic change and only claims to “correcting” all the problems.

The proposal is suggesting to use CGJ to preserve the old semantics when needed. The document needs to be more clear about how to preserve the semantics. In fact, there should be a clear algorithm to convert a string X to preserve the semantics when changing the (rendering) interpretation, since for a couple of decades users have been storing text in the current semantics of the encoding, which has been the only recommended way to do so by Unicode.

#3 and #4. In separate correspondence, items #3 and # 4 were withdrawn by their author as they stemmed from a misunderstanding of the algorithm. The text that caused that misunderstanding — steps 2b and 2c of the AMTRA — have been clarified as noted above.

# 5. Not enough details in the examples

The examples are missing the information needed for the average audience to understand the details. To be understood correctly, they need to be accompanied by the encoding of the text they are representing, and how the algorithm works on such a sequence.

#5. Example 4a was removed as it did not contribute to the document. After renumbering examples 4b and 4c, additional details — similar to those originally provided for examples 2a and 2b — were added to examples 1, 3, 4a and 4b.

Date/Time: Wed Jan 10 08:29:55 CST 2018

Name: r12a

Report Type: Error Report

Opt Subject: Use HTML rather than PDF

This is a comment from the W3C i18n WG.

http://www.unicode.org/reports/tr53/

When the spec is provided for review in PDF it isn't possible to

- link to a specific section in the review report
- copy the text into a report
- search for text in the document when reviewing reported issues.

Could we, in future, please provide HTML-based documents? (It's ok to use images for the examples that are unlikely to be rendered properly for all readers.)

The next draft is being prepared in HTML-based format rather than PDF.