L2/22-243

Comments on Public Review Issues
(July 11 - October 27, 2022)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of October 24, 2022, since the previous cumulative document was issued prior to UTC #172 (July 11, 2022).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of October 24, 2022.

Issue Name Feedback Link
458 Proposed Update UTR #17, Unicode Character Encoding Model (feedback) No feedback at this time
459 Proposed Update UTR #23, The Unicode Character Property Model (feedback) No feedback at this time

The links below go to locations in this document for feedback.

Feedback routed to CJK & Unihan Group for evaluation [CJK]
Feedback routed to Script ad hoc for evaluation [SAH]
Feedback routed to Properties & Algorithms Group for evaluation [PAG]
Feedback routed to Emoji SC for evaluation [ESC]
Feedback routed to Editorial Committee for evaluation [EDC]
Other Reports

 


Feedback routed to CJK & Unihan Group for evaluation [CJK]

Date/Time: Tue Jul 19 20:50:44 CDT 2022
Name: Eiso Chan
Report Type: Error Report
Opt Subject: New entry for UTN #43

U+20B9A 𠮚 should be tagged as B for BOPOMOFO LETTER R U+3116 ㄖ. U+20B9A 𠮚 is 
not a common character for the modern CJKV people.

Date/Time: Sun Jul 31 20:08:08 CDT 2022
Name: Eiso Chan
Report Type: Error Report
Opt Subject: kMandarin value for U+3D65

The current kMandarin value for U+3D65 㵥 is bì. Kangxi Dictionary shows 
覓畢切, and the pronunciation is same as 密, and it is the variant of 
U+3D35 㴵. Hanyu Dazidian shows the similar information. Hanyu Dazidian 
also shows the Putonghua reading for 㴵 is mì, which is the same in 
Unihan Database. One of my friend uses 㵥 in her name, and she told 
me the reading for 㵥 is mì in her name to follow Kangxi Dictionary.

It is better to update the kMandarin value for U+3D65 㵥 to mì.

Date/Time: Mon Aug 8 07:25:17 CDT 2022
Name: Andrew West
Report Type: Error Report
Opt Subject: CJK Unified Ideographs code chart

In the Unicode 15.0 beta code charts, UTC-00355 (⿰㫫頁) is mapped to U+9855 顕 (⿰显頁). 
It should be mapped to U+29530 𩔰 (⿰㫫頁).

Date/Time: Tue Aug 9 05:28:31 CDT 2022
Name: Andrew West
Report Type: Error Report
Opt Subject: Unihan_IRGSources.txt (15.0)

The kRSUnicode value for U+31D40 (⿰牜磨) is 112.15 (i.e. 石 radical), but this 
is unintuitive, and makes the character hard to find. Please add an additional 
kRSUnicode value of 93.16 (i.e. 牛 radical).

Date/Time: Tue Aug 9 06:18:28 CDT 2022
Name: Andrew West
Report Type: Error Report
Opt Subject: Unihan_IRGSources.txt (15.0)

U+31DBF (⿰氵穿) has a kRSUnicode value of 116.7 (i.e. 穴 radical). This is 
unintuitive, and makes the character hard to find. Please add an additional 
kRSUnicode value of 85.9 (i.e. 水 radical).

Date/Time: Mon Sep 5 08:44:48 CDT 2022
Name: Huáng Jùnliàng
Report Type: Error Report
Opt Subject: UniHan.zip/Unihan_Readings.txt

Currently, the kMandarin of U+2277B 𢝻 is hōng. However, according to GHZ pp.
2495
(https://homeinmists.ilotus.org/hd/hydzd3.php?st=page_no&kw=2495), 𢝻 is
a variant of 惚, so the kMandarin should be hū. The reading hū is also
supported by CNS11643: https://www.cns11643.gov.tw/wordView.jsp?ID=672838 

Date/Time: Thu Oct 27 16:18:28 CDT 2022
Name: Michel Mariani
Report Type: Other Document Submission
Opt Subject: Name of the fifth new Ideographic Description Character

To be considered by the UTC when they meet next week:

I had a quick look at the recently released document: "CJK & Unihan
Group Recommendations for UTC #172
Meeting" <https://www.unicode.org/L2/L2022/22247-cjk-unihan-group-utc173.pdf>
and I noticed that the Unicode name for the proposed fifth IDC character
(subtraction) is back to "IDEOGRAPHIC DESCRIPTION CHARACTER *STROKE*
SUBTRACTION" (possibly because it is planned to be located at the end of
the "CJK Strokes" block), after being briefly renamed "IDEOGRAPHIC
DESCRIPTION CHARACTER SUBTRACTION"
in <https://www.unicode.org/L2/L2021/21173r-cjk-unihan-group-utc169.pdf>:

> There was general agreement that the five IDCs (Ideographic Description
 Characters) in the preliminary proposal are useful and should be
 considered for encoding after the formal proposal has been submitted, but
 that the one named IDEOGRAPHIC DESCRIPTION CHARACTER STROKE SUBTRACTION
 should be renamed IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION (the word
 STROKE is removed) and should therefore allow components to be subtracted
 in addition to strokes.

Then, the character is mentioned as "IDEOGRAPHIC DESCRIPTION
CHARACTER *COMPONENT* SUBTRACTION"
in <https://www.unicode.org/L2/L2022/22191-five-new-idc-chars.pdf>.

I would like to point out a few issues with the latest suggested name:

- This new IDC character is already in use in practice, mainly in the
  IDS.TXT data file maintained by Andrew West, and it is currently already
  used to indicate a *component* subtraction, which gives far more
  flexibility, even if a component can sometimes be made of only one CJK
  stroke (CJK strokes being allowed in any component)...

- This new IDC character should be consistent with all other ones, which
  deal with *components*, and deciding of a name related to *strokes* is
  IMO too restrictive and somehow disconcerting...

- This IDC character could be also used in non-CJK ideographic contexts;
  such as Tangut, etc., and others yet to be defined, and so it should be
  as general as possible for future use...

--Michel Mariani

Feedback routed to Script ad hoc for evaluation [SAH]

Date/Time: Wed Sep 14 09:07:32 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Relative order of U+A802 and U+A823

L2/02-388 says “The correct encoded representation for this diphthong follows the phonological 
ordering: < Syloti Nagri dependent a, Syloti Nagri dvisvara sign >”. U+A802 SYLOTI NAGRI 
SIGN DVISVARA has Indic_Positional_Category=Top. U+A823 SYLOTI NAGRI VOWEL SIGN A has 
Indic_Positional_Category=Right. The usual order of Indic vowel signs in Unicode is left, top, 
bottom, right. Therefore, it seems like U+A802 should actually precede U+A823, but on the 
other hand Unicode often orders marks phonetically, so maybe U+A823 should precede U+A802.

Which order should Syloti Nagri text use? The standard should explicitly explain which order to use.

Feedback routed to Properties & Algorithms Group for evaluation [PAG]

Date/Time: Tue Jul 26 06:12:36 CDT 2022
Name: Oliver Kuederle
Report Type: Error Report
Opt Subject: UAX #29, 14.0.0

In Unicode Standard Annex #29 (Unicode Text Segmentation), v14.0.0, there
appears to be an inconsistency between the grapheme cluster boundary rules
and the word boundary rules. Specifically, rule GB13 states that a pair of
regional indicators may not be broken. If a zero-width joiner precedes a
regional indicator, this matches [^RI] and the counting of RI thus starts
again. There is no exception for ZWJ in this specific case.

For word boundaries, however, rule WB4 will cause an RI before a ZWJ to
maintain its count (WB15/WB16). So the following sequence will break
differently for graphemes and for words:

RI ZWJ RI RI

Following the grapheme rules, this will lead to:

RI × ZWJ ÷ RI × RI

And for word rules, this will lead to:

RI × ZWJ × RI ÷ RI

The word rules will therefore break a grapheme cluster which is probably not
intended.

Date/Time: Mon Aug 22 14:40:56 CDT 2022
Name: Charlotte Buff
Report Type: Error Report
Opt Subject: Line break class of U+1342F

U+1342F EGYPTIAN HIEROGLYPH V011D currently has Line_Break=Alphabetic (AL) in 
the preliminary data files for Unicode 15. Because this hieroglyph is the start 
of a cartouche, it should have Line_Break=Open_Punctuation (OP) instead. 
This property value is shared by all other hieroglyphs with a similar 
function (U+13258..U+1325A, U+13286, U+13288, U+13379).

Date/Time: Thu Sep 8 15:38:26 CDT 2022
Name: Asmus/
Report Type: Website Problem
Opt Subject:

https://www.unicode.org/policies/stability_policy.html 

This page should cite definitions of terms such as "domain". This could be
done either by citing the location of their formal definition of, perhaps
better by making them glossary links and then ensuring that any glossary
item always cites the formal definition its based on.

This came up in the context of adding the "domain stability" which
introduces the word "domain" which perhaps is not in everybody's active
vocab.

Date/Time: Thu Sep 15 03:28:12 CDT 2022
Name: Rossen Mikhov [Ed Note: Email to this person always fails, so they cannot be contacted; this applies to all of their submissions below.]
Report Type: Error Report
Opt Subject: UTS #18: Unicode Regular Expressions

https://www.unicode.org/reports/tr18/#Subtraction_and_Intersection 
Version 23
Date 2022-02-08

Location:
Section "1.3 Subtraction and Intersection", near the end of the section.

Wrong text:
Thus the following matches all code points that neither have a Script value of 
Greek nor are in Basic_Emoji:
    [^[\p{Script=Greek} && \p{Basic_Emoji}]] 

Possible correction:
Thus the following matches all code points that do not simultaneously have a 
Script value of Greek and are in Basic_Emoji:

Suggestion:
There are no Greek emoji, so the example actually matches all Unicode code 
points. Perhaps a more illustrative example should be given.

Date/Time: Thu Sep 15 05:52:10 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: Unicode Chapter 3 Conformance

https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf 
Version 15.0.0

Location:
D62b Graphical Application

Problematic text:

A nonspacing mark in a defective combining character sequence is not part of
a grapheme cluster and is subject to the same kinds of fallback processing
as for any defective combining character sequence.

Explanation:

"Grapheme cluster" is defined in D60 as "The text between grapheme cluster
 boundaries". So, formally, any character is part of some grapheme cluster,
 be it a degenerate one.

What is more troubling with this definition D62b is that it states that
nonspacing marks apply to grapheme bases, with "Grapheme base" being
defined in D58 as based on Grapheme_Base. But Grapheme_Base is no longer
used by UAX29. It isn't clear if nonspacing marks should "graphically
apply" to things other than Grapheme_Base characters and Korean syllables,
for example what about emoji ZWJ sequences.

Date/Time: Fri Sep 16 09:45:20 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #29: Unicode Text Segmentation


https://www.unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters 
Version: Unicode 15.0.0
Date: 2022-08-26
Revision: 41

Location: Table 1b. Combining Character Sequences and Grapheme Clusters

Problematic text:
legacy grapheme cluster:  crlf | Control | legacy-core legacy-postcore*
extended grapheme cluster:  crlf | Control | precore* core postcore*

Possible correction:
legacy grapheme cluster:  crlf | CR | LF | Control | legacy-core legacy-postcore*
extended grapheme cluster:  crlf | CR | LF | Control | precore* core postcore*

Alternative possible correction:
(In table 1c) crlf := CR LF | CR | LF

Explanation:
Looks like a simple editorial omission.
With this minor correction, the regular expressions exactly correspond to 
the specification of the rules GB1-GB999.

Date/Time: Fri Sep 16 09:51:21 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #29: Unicode Text Segmentation

https://www.unicode.org/reports/tr29/#Testing 
Version: Unicode 15.0.0
Date: 2022-08-26
Revision: 41

Location: 7 Testing

Problematic text:
Note: Testing two adjacent characters is insufficient for determining a boundary, 
except for the case of the default grapheme clusters.

Possible correction:
Note: Testing two adjacent characters is insufficient for determining a boundary.

Explanation:
Maybe the easiest counterexample is a sequence of many RI characters. There is no 
fixed limit to the number of preceding characters needed for context.

Date/Time: Wed Sep 21 02:47:38 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

UAX #29: Unicode Text Segmentation
https://www.unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters 
Version: Unicode 15.0.0
Date: 2022-08-26
Revision: 41

UAX #14: Unicode Line Breaking Algorithm
https://www.unicode.org/reports/tr14/#Dictionary 
Version: Unicode 15.0.0
Date: 2022-08-16
Revision: 49

Location: 5.2 Dictionary Usage

Problematic text:
BBC English Dictionary: sIləbl where I is <U+026A, U+0332> and ə is U+0259. 
The vowel of the stressed syllable is underlined.
Collins Cobuild English Language Dictionary: sIləbə°l where I is <U+026A, U+0332> 
and has the same meaning as in the BBC English Dictionary. The ə is U+0259 (both times). 
The ° is a U+2070 and indicates the schwa may be omitted.

Explanation:
The typeset examples do not correspond to the explanation text.
Specifically, the examples have the final letter "l" underlined (with an HTML <u> 
tag, not with U+0332, so cannot reproduce here). But this is not the stressed vowel. 
This should not be underlined and instead the second letter "I" should be underlined.

The typeset examples in this section also deviate from the explanations in other ways 
("I" is not U+026A as stated, "°" is not U+2070 as stated, etc.) but those are visually 
similar and can be forgiven for lack of fonts or something in the document producing system.

Date/Time: Wed Sep 21 07:53:00 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#Examples 
Version: Unicode 15.0.0
Date: 2022-08-16
Revision: 49

Location: 8.2 Examples of Customization, Example 7

Problematic text 1:

The tailoring can be accomplished by first segmenting the text into grapheme
clusters according to the rules defined in UAX #29, and then finding line
breaks according to the default line break rules, giving each grapheme
cluster the line breaking class of its first code point.

Explanation:

This tailoring wouldn't be conforming in edge cases. Suppose the text
is <CR, LF, LF>. After applying UAX #29, this becomes two grapheme
clusters <CR, LF> and <LF>, with first code points <CR>
and <LF>, respectively. Then default line breaking rules would
prevent a line break between these, contrary to the conformance requirement
for a mandatory break.

Problematic text 2:
An example of a grapheme cluster that would be split by the default line 
break rules is a Zero Width Space followed by a combining mark.

Explanation:
According to the latest version of UAX #29, Zero Width Space followed by 
a combining mark does not form one grapheme cluster (ZWSP has Grapheme_Cluster_Break=Control).

Feedback routed to Emoji SC for evaluation [ESC]

Date/Time: Thu Sep 15 11:12:15 CDT 2022
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UTS #51: Unicode Emoji

https://www.unicode.org/reports/tr51/#gender-neutral 
Version: 15.0
Date: 2022-08-31
Revision: 23

Location:
"2.3.1 Gender-Neutral Emoji", near the end of the section

Wrong text:
Gender-neutral versions of the profession or role emoji using object format type ZWJ 
sequences are promulgated by adding them to the *RGI emoji tag sequence set*.

Possible correction:
Gender-neutral versions of the profession or role emoji using object format type ZWJ 
sequences are promulgated by adding them to the *RGI emoji ZWJ sequence set*.

Feedback routed to Editorial Committee for evaluation [EDC]

Date/Time: Tue Jul 19 14:30:32 CDT 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UTR #54

UTR #54 contains the mistake “a one of several” (instead of just “one of several”) 
and a needless comma here: “Separation of the glyph variant information and documentation 
of all the associated contextual rules and their interaction with the Mongolian text model, 
from the production of versioned code charts would also make it possible to update this 
information much more quickly.”

Date/Time: Sun Aug 7 06:29:40 CDT 2022
Name: Yasuhiro Inukai
Report Type: Error Report
Opt Subject: Unicode Standard Version 14.0 Core Specification

There is an error in Figure 13-7 on p.556 of Unicode Standard Version 14.0 Core 
Specification (https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf).
Under “cherig”, the example glyph just to the right of “1821” is not correct. It shows 
U+1822 (MONGOLIAN LETTER I)-like glyph instead of U+1821 (MONGOLIAN LETTER E).
Thanks,

Date/Time: Wed Sep 14 09:09:43 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Typo in chapter 9


Chapter 9 includes the word “AARABIC” in the “High Hamza” section.

Date/Time: Sun Sep 18 12:29:03 CDT 2022
Name: Mark Longley
Report Type: Error Report
Opt Subject: Unicode Standard Version 15.0 Core Specification

In the Unicode Standard - Version 15.0 - Core Specification in section 23.9 
Tag Characters on page 945 there is a minuscule error in the second subsection 
Deprecated Use for Language Tagging. It is stated that "In Version 8.0, all 
but the language tag identification character were un-deprecated" whereas 
in fact U+E007F CANCEL TAG was still deprecated in Version 8.0 and was not 
un-deprecated until Version 9.0.

Date/Time: Fri Sep 23 22:38:51 CDT 2022
Name: Pablo Sebastián Viola
Report Type: Error Report
Opt Subject: UnicodeStandard-15.0.pdf

I am reading the file stored in 
https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf.

In page xxiii I see that the Unicode version 15.0 is referred,
what confirms that I am reading the right file.

However, in many places, the document refers to itself as the version 14.0.
I found mentions to the version 14.0 that probably are wrong,
in pages: 75, 76, 77, 83.

In the Index there are entries referred to Version 14.0 that probably should be 15.0:
"Characters, .... number encoded in version 14.0.... p.3",
"Version 14.0..... p.77".

There are other places where the version 14.0 is mentioned, but they are probably right.

Date/Time: Tue Oct 11 15:20:10 CDT 2022
Name: Markus Scherer
Report Type: Error Report
Opt Subject: core spec 2.9 Details of Allocation vs. plane 3

Figure 2-13 Unicode Allocation ( https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf#G286741 
page 47) still shows the U+3xxxx plane as Reserved. We have had CJK characters 
there since Unicode 13. This plane should be shaded for Graphic characters.

The text on page 51 about "Plane 3 (TIP)" might be fine, but I suspect that its 
statement that it "is dedicated to encoding additional unified CJK characters" 
also predates Unicode 13. At a minimum, we should add a comma after the "(TIP)" 
in the paragraph, but it probably wants to read more like the text for plane 2.

Date/Time: Thu Oct 13 07:15:03 CDT 2022
Name: Mark Longley
Report Type: Error Report
Opt Subject: The Unicode® Standard Version 15.0 – Core Specification

There is a typographical error in Chapter 22 Symbols in section 22.10 Enclosed 
and Square in subsection Enclosed Alphanumeric Supplement: U+1F100–U+1F1FF in 
subsubsection Creative Commons License Symbols on page 910.

The first of the two character code ranges is given as “U+1F10D..U+1F10FF” when 
the end of this range should in fact be “U+1F10F”, i.e. there is a spurious 
duplicated terminal ‘F’ hexadecimal digit.

Other Reports

(None at this time.)