L2/22-123

Comments on Public Review Issues
(April 11 - July 11, 2022)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 11 2022, since the previous cumulative document was issued prior to UTC #171 (April 2022).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of  July 11, 2022.

Issue Name Feedback Link
457 Proposed Update UAX #42, Unicode Character Database in XML (feedback) No feedback at this time
455 Proposed Update UTS #46, Unicode IDNA Compatibility Processing (feedback) No feedback at this time
454 Proposed Update UTS #51, Unicode Emoji (feedback)
453 Unicode 15.0.0 Beta (feedback)
452 Proposed Update UAX #15, Unicode Normalization Forms (feedback) No feedback at this time
451 Proposed Update UTS #39 Unicode Security Mechanisms (feedback)
450 Proposed Update UAX #31 Unicode Identifier and Pattern Syntax (feedback)
449 Proposed Update UAX #9, Unicode Bidirectional Algorithm (feedback)
448 Proposed Update UAX #41, Common References for Unicode Standard Annexes (feedback) No feedback at this time
447 Proposed Update UAX #24, Unicode Script Property (feedback) No feedback at this time
446 Proposed Update UAX #14, Unicode Line Breaking Algorithm (feedback)
445 Proposed Update UAX #45, U-source Ideographs (feedback)
444 Proposed Update UAX #34, Unicode Named Character Sequences (feedback) No feedback at this time
443 Unicode Emoji 15.0 Draft Candidates (feedback)
442 Unicode 15.0 Alpha Review (feedback)
441 Proposed Update UAX #29, Unicode Text Segmentation (feedback)
440 Proposed Update UTS #10, Unicode Collation Algorithm (feedback)
439 Proposed Update UAX #50, Unicode Vertical Text Layout (feedback) No feedback at this time
438 Proposed Update UAX #44, Unicode Character Database (feedback)
437 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback)
434 CLDR Person Name Formatting (feedback)

The links below go to locations in this document for feedback.

Feedback routed to CJK & Unihan Group for evaluation [CJK]
Feedback routed to Script ad hoc for evaluation [SAH]
Feedback routed to Properties & Algorithms Group for evaluation [PAG]
Feedback routed to Emoji SC for evaluation [ESC]
Feedback routed to Editorial Committee for evaluation [EDC]
Other Reports

 


Feedback routed to CJK & Unihan Group for evaluation [CJK]

Date/Time: Wed May 4 02:13:35 CDT 2022
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: Radical-stroke value for U+2C4F8

The radical-stroke value for U+2C4F8 𬓸 should be changed from the current 
115.10 (radical 禾) to 202.3 (radical 黍).
cf. U+4D58 䵘 202.9

Date/Time: Wed May 4 16:30:12 CDT 2022
Name: Lee Collins
Report Type: Error Report
Opt Subject: Unihan_Readings.txt

Note: This issue was resolved during UTC #170.

 U+7550 kDefinition is "to fill; a foll of cloth". I cannot find a word 
 "foll" in this sense in the English dictionaries I looked at. Perhaps 
 it is an older usage. Or, maybe it is a typo for "roll". Kangxi says 
 that U+7550 is the same as U+5E45  幅 and defines it as "布帛廣也". 
 Perhaps "width of cloth" is a better definition.

Feedback routed to Script ad hoc for evaluation [SAH]

Date/Time: Wed Apr 13 16:38:50 CDT 2022
Name: Asmus
Report Type: Error Report
Opt Subject: TUS Chapter 14, section Phags-Pa

(1)I stumbled over a bit of editorial conventions that, while correct, 
were leading astray.

(2)Looks like there's a loosely worded bit that's not actually correct.

(1) When I just now opened the section at random, it took me a while to
mentally switch gears and realize that "letter o" in the passage quoted
below was the Phags-pa letter. The conventions are all clear, if you know
them, but 'o' is unfortunately not giving any internal hint that it's
derived from a transcription. Wish there was something unobtrusive to help
guide the reader. (It didn't help that I had "letter o" - the Latin one -
on my mind from some other project).

Perhaps add the script name here even if redundant??

---


    The invisible format characters U+200D ZERO WIDTH JOINER (ZWJ) and
    U+200C ZERO WIDTH NON-JOINER (ZWNJ) may be used to override the
    expected shaping behavior, in the same way that they do for Mongolian
    and other scripts (see ⁅†Chapter 23, Special Areas and Format
    Characters†⁆). For example, ZWJ may be used to select the initial,
    medial, or final form of a letter in isolation:

    <U+200D, U+A861, U+200D> selects the medial form of the letter o

    <U+200D, U+A861> selects the final form of the letter o

    <U+A861, U+200D> selects the initial form of the letter o

---

(2) More importantly there seems to be something possibly misstated here:

    "Conversely, ZWNJ may be used to inhibit expected shaping. For example,
     the sequence <U+A85E, U+200C, U+A85F, U+200C, U+A860, U+200C,
     U+A861> selects the isolate forms of the letters i, u, e, and o."

It should be the case that: the isolate forms for 'i' and 'o' in this
example are only selected if they don't join with surrounding characters
across the boundaries of the sequence. (There's nothing in the definition
of a sequence that prevents it from being embedded in other text).(Can't be
sure, but from the table it looks like all vowels are dual joining).

It looks like there's an implicit assumption in the text that the sequence
is standalone.

Date/Time: Sat Apr 16 08:57:23 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Chorasmian number seven

How should the Chorasmian number seven on page 47 of L2/18-164R2 be 
encoded? There is no obvious gap or longer stroke. It is therefore not 
clear how to use U+10FC5..U+U+10FC8 to represent it, or even whether 
it can be encoded in Unicode.

Date/Time: Fri Apr 22 20:34:20 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Dives Akuru line breaking

This is feedback on L2/22-080R. Another script with line breaks between 
orthographic syllables is Dives Akuru. L2/18-016R “Proposal to encode Dives 
Akuru in Unicode” says “A word may be broken along orthographic syllables 
at any position at the end of a line.” U+1193F and U+11941 would get lb=AP, 
the letters including the independent vowels would get lb=AK, and U+1193E would get lb=VI.

Date/Time: Tue Apr 26 12:15:07 CDT 2022
Name: Sławomir Osipiuk
Report Type: Other Document Submission
Opt Subject: Feedback on L2/22-092 (Proposal to add the currency sign for the POLISH ZŁOTY to the UCS)

I would like to offer additional information which may be of interest to the
submitter of L2/22-092.

The original proposal omits, to its detriment, that the single-character
złoty symbol is also present in the 7-bit character set specified by Polish
national standard BN-74/3101-01. As a national standard, this may have more
persuasive power for inclusion of this character, and the submitter may
want to amend its proposal to include this information.

Additionally of potential interest, BN-74/3101-01, being a national version
of the 7-bit character set conforming to ISO 646, would seem a natural
addition to the ISO International Register of Coded Character Sets per ISO
2022 and ISO 2375 (currently managed by the ITSCJ:
https://www.itscj-ipsj.jp/english.html). However, BN-74/3101-01 was never
added to the Register for reasons I am not aware of (and the Register
itself has not seen any additions since the year 2004). If this character
set had been added in the past, then inclusion of the złoty symbol in
Unicode/ISO 10646 would have been very likely.

Date/Time: Wed Apr 27 22:09:17 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Unclear phrasing re complex quadrats

Section 11.4 says “Sometimes a portion of a graphically complex quadrat
could be identified as an atomically encoded character. However, in cases
where the use of that atomically encoded character as a component of a
quadrat sequence would cause ambiguities or uneven distribution in the
structure, then a sequence of simpler hieroglyphs should be used instead,
with the appropriate joining controls.” This implies that there exist four
contexts for an atomically encoded character.

1. Not in a quadrat sequence
2. Causing ambiguities in a quadrat sequence
3. Causing uneven distribution in a quadrat sequence
4. In a quadrat sequence without any problems

Does the fourth context really exist? Does it ever make sense to put an
atomically encoded character in a quadrat sequence? I don’t think so: I
think the quoted passage means that atomically encoded characters in
quadrat sequences should always be avoided, because they are always either
ambiguous or uneven. However, that is not actually what it says. That
sentence should be reworded to something stronger: changing “in cases
where” to “because” would fix it.

Alternatively, if the fourth context does exist, it would be helpful for the
standard to provide an example.

Date/Time: Wed May 18 02:41:15 CDT 2022
Name: Charlotte Buff
Report Type: Other Document Submission
Opt Subject: Issue with precomposed Todhri characters (L2/22-074)

The recently approved Todhri script (cf. L2/20-188r: Everson, „Proposal for
encoding the Todhri script in the SMP of the UCS“) includes two letters
that are formed from a base letter plus a dot diacritic: *U+105C9 TODHRI
LETTER EI and *U+105E4 TODHRI LETTER U. Per consensus 171-C17, it was
decided to encode these as precomposed characters with canonical
decompositions featuring U+0307 COMBINING DOT ABOVE, as was suggested in
L2/22-074 (Pournader, „Todhri encoding options“).

However, this approach is not possible to implement as originally intended.
According to section 5.1 of UAX #15, Unicode Normalization Forms:

    »A canonical decomposable character *must* be added to the list of
    post composition version exclusions when its decomposition mapping
    is defined to contain at least one character which was already
    encoded in an earlier version of the Unicode Standard.«

Because COMBINING DOT ABOVE is already encoded, using it as the dot
diacritic for Todhri would necessitate adding TODHRI LETTER EI and TODHRI
LETTER U to the list of composition exclusions, meaning these two
characters could never appear in normalised text. This would make their
existence as precomposed characters rather superfluous.

Feedback routed to Properties & Algorithms Group for evaluation [PAG]

Date/Time: Sun Apr 17 12:41:19 CDT 2022
Name: Karl Williamson
Report Type: Other Document Submission
Opt Subject: NonBidiMirroring.txt

https://www.unicode.org/L2/L2022/22026-non-bidi-mirroring.pdf is a 
proposal from Kent Karlsson for creation of this UCD file

I saw that a proposed response to it was that it was "speculative".

I can tell you that Perl 5 already has had to workaround the absence 
of such information in the UCD, and the presence of this would be 
helpful going forward.

The issue for us is delimiters surrounding string-like constructs.
These constructs include literal text, and regular expression patterns, 
among others.  Perl has long allowed one to use any of 4 pairs of 
delimiters for these, like 
  qr(this is a pattern)

The 4 sets are () <> {} [].  These stem from before Unicode came 
along, and now Unicode has added hundreds of potential such delimiters.
We've had longstanding requests to use this, and the next release of 
Perl will add many of them.  It would have been better to have used 
this proposed file if it had existed, and I did go looking for 
something suitable, to no avail.  It would be better in the future 
to use this file, as it gets updated to correspond with new Unicode versions.

Date/Time: Fri Apr 22 12:02:13 CDT 2022
Name: Tim Pederick
Report Type: Error Report
Opt Subject: tr15-51.html

UAX #15, §1.2 Normalization Forms, says of figures 3 to 6 that
"[f]or consistency, all of these examples use Latin characters". This is 
not true of figure 3, in which the second example uses only the Greek
characters U+2126 and U+03A9. (And to be pedantic, figure 5 has an example
with only the Common characters U+0032, U+2075, and U+0035.)

I don't propose replacing the examples with ones that do use Latin
characters, but rather changing the note itself, or even removing it. I'm
not really sure what is meant by "for consistency"; is it
really "inconsistent" to use non-Latin examples? Is the intent of the note
to head off complaints of Latin-script parochialism?

Date/Time: Tue May 3 05:59:18 CDT 2022
Name: Henri Sivonen
Report Type: Error Report
Opt Subject: DUCET

https://github.com/unicode-org/cldr/blob/main/common/collation/he.xml has
the following tailoring (apart from script reordering):

    &[before 2]''<<׳ # GERESH just before APOSTROPHE (secondary difference)
    &[before 2]'\"'<<״ # GERSHAYIM just before QUOTATION MARK (secondary difference)

The other Hebrew-script language in CLDR, Yiddish, has this same 
tailoring (and further tailorings).

https://github.com/unicode-org/cldr/blob/main/common/collation/yi.xml 

It seems generally unfortunate, both from the user perspective and from the
binary size perspective of shipping an implementation, when a language
requires a tailoring even though its tailoring doesn't collide with the
needs of other languages in CLDR. By hoisting this tailoring into DUCET,
Hebrew could use the root collation with script reording, like, for
example, Greek and Georgian. The handling of й/Й in the Cyrillic script in
DUCET looks like precedent of hoisting collation complexity shared by
merely the majority (not even all) of languages for a script into DUCET. In
this case, the tailoring applies to both languages for the script.

(I'm filing this about DUCET as opposed to filing this about CLDR root,
because CLDR root seeks to minimize differences from DUCET.)

Date/Time: Tue May 3 06:00:27 CDT 2022
Name: Henri Sivonen
Report Type: Error Report
Opt Subject: DUCET

https://github.com/unicode-org/cldr/blob/main/common/collation/hy.xml  
has the following tailoring (apart from script reordering):

&ք<և<<<Եւ

There are no other Armenian-script languages in CLDR.

It seems generally unfortunate, both from the user perspective and from the
binary size perspective of shipping an implementation, when a language
requires a tailoring even though its tailoring doesn't collide with the
needs of other languages in CLDR. By hoisting this tailoring into DUCET,
Armenian could use the root collation with script reording, like, for
example, Greek and Georgian. The handling of й/Й in the Cyrillic script in
DUCET looks like precedent of hoisting collation complexity shared by
merely the majority (not even all) of languages for a script into DUCET. In
this case, the tailoring applies to the only language for the script.

(I'm filing this about DUCET as opposed to filing this about CLDR root,
because CLDR root seeks to minimize differences from DUCET.)

Date/Time: Thu May 5 19:38:08 CDT 2022
Name: Karl Wagner
Report Type: Error Report
Opt Subject: UTS #46: UNICODE IDNA COMPATIBILITY PROCESSING

UTS #46

Version: 14.0.0
Date: 2021-08-24
Revision: 27
URL: https://www.unicode.org/reports/tr46/ 

---

I only just started writing my own implementation of this recently, so
apologies if I'm misunderstanding, but there are two locations where
code-points are checked. Using the same format as the IdnaTestV2.txt file
for describing those locations, they would be P1 and V6 ("Processing" step
1, and "Validation" step 6).

- P1 is applied to the entire domain, as given. So it may see
  (decoded) Unicode text, or Punycode. It takes the value of
  UseSTD3ASCIIRules in to account, so a domain like "≠ᢙ≯.com" triggers the
  error at P1 only if UseSTD3ASCIIRules=true, because it contains a
  code-point which STD3ASCIIRules disallows. "xn--jbf911clb.com" will never
  trigger the error at this location, regardless of UseSTD3ASCIIRules,
  because it is just ASCII and hasn't been decoded yet.

- V6 is applied to the result of Punycode-decoding a domain label, so it
  will only see decoded Unicode text. As written, it would appear **not**
  to take UseSTD3ASCIIRules in to consideration, meaning that both
  (original inputs) "≠ᢙ≯.com" and "xn--jbf911clb.com" would trigger errors
  at this location, regardless of UseSTD3ASCIIRules.

Here is the text of Section 4.1, Validity Criteria
( https://www.unicode.org/reports/tr46/#Validity_Criteria ), Step 6:

> Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table:
> - For Transitional Processing, each value must be valid.
> - For Nontransitional Processing, each value must be either valid or deviation.

It is not clear whether these status values are supposed to take the value
of UseSTD3ASCIIRules in to account. As described above, if this step does
not consider UseSTD3ASCIIRules, "≠ᢙ≯.com" and "xn--jbf911clb.com" will
always be invalid domains. This leads me to believe that it **should**
respect UseSTD3ASCIIRules, otherwise the parameter would be meaningless; it
does not matter that P1 considers UseSTD3ASCIIRules, because it will be
caught by V6 later anyway. 

I'll have to apologise again because I am not very familiar with the
codebases I am about to cite, but from what I can glean this is actually
causing confusion in practice:

- Unicode-org implementation of IDNA not appear to consider
  UseSTD3ASCIIRules here:
  https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/idna/Uts46.java#L610-L625 

- This appears to be confirmed by the IdnaTestV2 file. For example, Version
  14.0.0 (Date: 2021-08-17, 19:34:01 GMT) lines 571 and 573:

[571] xn--jbf911clb.xn----p9j493ivi4l; ≠ᢙ≯.솣-ᡴⴀ; [V6]; xn--jbf911clb.xn----p9j493ivi4l; ; ;  # ≠ᢙ≯.솣-ᡴⴀ
[573] xn--jbf911clb.xn----6zg521d196p; ≠ᢙ≯.솣-ᡴႠ; [V6]; xn--jbf911clb.xn----6zg521d196p; ; ;  # ≠ᢙ≯.솣-ᡴႠ

"V6" is not an optional validation step tied to any parameter; it does not
 appear to be something implementations can decide whether or not it
 applies to them. It always applies, and these domains should always be
 considered invalid IIUC, according to the tests.

- JSDOM implementation does consider UseSTD3ASCIIRules, considers these to
  be valid domains:
  https://github.com/jsdom/tr46/blob/e937be8d9c04b7938707fc3701e50118b7c023a5/index.js#L100 

- Browsers effectively do in URLs. Safari 15 and JSOM both
  consider "http://≠ᢙ≯.com.xn--jbf911clb" to be a perfectly fine URL:
  https://jsdom.github.io/whatwg-url/#url=aHR0cDovL+KJoOGimeKJry5jb20ueG4tLWpiZjkxMWNsYg==&base=YWJvdXQ6Ymxhbms= 

So I think it is worth adding an explicit mention of UseSTD3ASCIIRules and
whether or not it applies to the mapping table lookup from step V6.

Thanks,

Karl

Date/Time: Tue May 31 21:17:28 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Unclear namespace in UTS #18

UTS #18 says “The namespace for the \p{name=...} syntax is the namespace for
character names plus name aliases.” This could be misinterpreted to mean
that that namespace excludes code point labels, even though code point
labels are discussed earlier in that section. It would be clearer to
say “The namespace for the \p{name=...} syntax is the Unicode namespace for
character names”, using the term defined in UAX34-D3, which in its next
version will mention code point labels.

Feedback routed to Emoji SC for evaluation [ESC]

(None at this time.)


Feedback routed to Editorial Committee for evaluation [EDC]

Date/Time: Fri Apr 22 11:11:19 CDT 2022
Name: Tim Pederick
Report Type: Error Report
Opt Subject: UnicodeData.txt

U+33D7 SQUARE PH has a compatibility decomposition mapping of <U+0050
LATIN CAPITAL LETTER P, U+0048 LATIN CAPITAL LETTER H>.

This character would appear to be intended to represent the pH measurement
in chemistry, and as such the mapping should have had different letter
case: <U+0070 LATIN SMALL LETTER P, U+0048 LATIN CAPITAL LETTER H>.

The Strong Normalization Stability policy says that this cannot be changed,
and perhaps it is sufficiently trivial to be beneath notice, but perhaps it
could be documented?

Date/Time: Fri Apr 22 20:39:14 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Diaeresis on capital Armenian letters

Chapter 7 says “In Armenian dialect materials, U+0308 COMBINING DIAERESIS,
appears over uppercase U+0531 ayb and lowercase U+0561 ayb, and lowercase
U+0585 oh and U+0578 vo.” Because all caps is used in Armenian, it appears
over uppercase U+0555 oh and U+0548 vo too.

http://www.nayiri.com/imagedDictionaryBrowser.jsp?dictionaryId=101&dt=HY_HY&pageNumber=577 

has an example with U+0548 in the second headword of the third column and
an example with U+0555 in the fourth headword of the third column; the
diacritic looks like U+030F but it’s probably just U+0308. Chapter 7 should
say that U+0308 is used with all six of these bases.

Also, the comma after “DIAERESIS” should be removed.

Other Reports

Date/Time: Wed Jul 6 10:01:07 CDT 2022
Name: Deborah Anderson
Report Type: Other Document Submission
Opt Subject: Sunuwar chart glyph error

Neil Patel noticed that the glyphs for 11BD2 SUNUWAR LETTER SHYELE 
and 11BDC SUNUWAR LETTER SHYER were swapped in the Sunuwar code chart 
(p. 14 of L2/21-157R). Cf. p. 7 of the proposal, where the glyphs 
are correct.

The correct glyphs appear in Michel Suignard's ISO/IEC 10646 
repertoire proposals post Amd1 (WG2 N5181).

The UTC accepted Sunuwar based on L2/21-157R. I recommend the 
UTC go on record noting the error in the code chart in L2/21-157R, 
noting that the correct glyphs appear in WG2 N5181.