Comments on Public Review Issues
(April 28 - July 23, 2019)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of April 28, 2019, since the previous cumulative document was issued prior to UTC #159 (April 2019).


The links below go directly to open PRIs and to feedback documents for them, as of July 18, 2019.

Issue Name Feedback Link
400 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback)
399 Proposed Update UAX #45, U-source Ideographs (feedback)
398 Proposed Update UAX #44, Unicode Character Database (feedback) No feedback at this time
397 Proposed Draft UTR #54, Unicode Mongolian 12.1 Baseline (feedback) No feedback at this time
396 Proposed Update UAX #29, Unicode Text Segmentation (feedback)
395 Proposed Update UAX #15, Unicode Normalization Forms (feedback) No feedback at this time

The links below go to locations in this document for feedback.

Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports

Note: The section of Feedback on Encoding Proposals this time includes:
L2/10-345  L2/17-236  L2/17-300  L2/17-326  L2/18-182  L2/18-183  L2/18-198  L2/18-242  L2/19-005R  L2/19-091  L2/19-172  L2/19-199  L2/19-203 


Feedback to UTC / Encoding Proposals

Date/Time: Tue May 7 15:58:33 CDT 2019
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Comments on L2/19-005R

L2/19-005R “Proposal to encode ORIYA SIGN OVERLINE in the UCS” should
explain where to put the proposed code point in a syllable. I assume it is
either meant to immediately follow the vowel sign or to immediately follow
the base consonant.

Either way, Indic_Syllabic_Category=Vowel_Dependent is not appropriate for
this character: above-base dependent vowels are encoded between pre- and
post-base dependent vowels, so the overline would have to follow U+0B47
ORIYA VOWEL SIGN E but precede U+0B3E ORIYA VOWEL SIGN AA. It should instead
have InSC=Nukta or Syllable_Modifier, depending on where it is meant to go.

The proposal claims that “if the combining macron were to be used, it would
not be supported in the general Indic rendering system implementation
requirement. If the combining macron were used, script runs in Oriya would
break”. That is not true. U+0304 COMBINING MACRON is a common-script
character so it does not break script runs. There is no Indic rendering
system requirement that all marks be Indic-specific. For example, U+20F0
COMBINING ASTERISK ABOVE is used in Devanagari without any rendering system
problems. Therefore that argument should be removed from the proposal.

Date/Time: Sat May 25 11:43:13 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: Five requested items of feedback for L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji

In L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji there
is a request for feedback on five specific issues.

Here is my feedback.

Issue 1: Length

I suggest using the direct method of the tag digits in the uncompressed
format. Although compression might save 30% to 40% of bytes for one QID
emoji character on its own, the saving in bytes as a percentage of a whole
document could be much lower, depending upon how many QID emoji are used in
a particular document.

In my test fonts I have used large bold glyphs for the TAG Q and for the tag
digits and a visible glyph for the CANCEL TAG. I found entering them from
the Glyph Browser facility in the Serif Affinity Publisher (beta)
OpenType-aware software program to be straightforward. I appreciate that the
intention may be for entry of QID emoji into a document typically from a
cascading menu system but that might not always be available in every
application. Using large visible glyphs for just twelve tag characters is
convenient for fontmaking and for use.

In my opinion using a compressed format is simply adding a layer of
complication for a relatively minimal overall saving of bytes. I opine that
it is best to keep the system as simple to use as possible.

Issue 2: Tag Base

I suggest having one standardized tag base. That will allow for

Regarding the "first mover" effect that is mentioned if a variety of tag
bases are used. That might sound fine if it is all large companies and they
all meet up as full members of Unicode Inc., but what if, say, a small
European company is the first mover and some time later a large American
corporation decides that it is not willing to allow any implied recognition
of the small European company and does something different. Suppose then
that Unicode Inc. is put in the position of making the decision as to which
is to be the tag base to be used for that QID emoji. Unicode Inc. might then
find itself in a very awkward situation of what to do, particularly if
either of the two businesses felt that they were in the right, either on the
basis of being first or on the basis of having a very much larger share of
the market. What happens to interoperability if both of the businesses
carries on using the base character of its choice?

That is just one scenario for one QID emoji. If that or other issues
happened for the choice of base character for many QID emoji then there
could be much confusion.

Although having a fallback character could be helpful in some circumstances
it could also introduce uncertainty and confusion over the meaning that the
author of a document intended.

I therefore suggest that a single standardized base character is used.

Issue 3: Sequences

It seems to me that it would be better to use the first method suggested and
keep it all within UTS #51. It is an added facility that would be thus
clearly explained within UTS #51. This would make it easier to understand
for people learning and for those who are not within the central group of
people habitually working with the documents.

Issue 4: Registry

The idea of the registry is attractive and could be useful. Yet what would
the term "in use" mean. For example, suppose that one Wednesday afternoon
and on a few other occasions some university students have a go at
specifying some QID items and designing and producing some QID emoji,
complete with some fonts, maybe of such things as a statue that is on campus
or a few statues and so on from the local town and send some messages to a
few newsgroups and add some images on a website and have a good learning
experience and some fun doing so and then that is that and they move on to
other things, though they might go back to it later. Would Unicode include
those QID emoji in the Registry? One of the students might have learned of
the registry and send them in requesting inclusion. If the university were
in the United Kingdom then they might have also been deposited at the
British Library under legal deposit. So would they count for inclusion in
the registry?

So for Unicode Inc. to have a registry then it would need to decide whether
to include absolutely every QID emoji ever produced or to have a threshold
of some kind and then having a threshold means that there may be edge cases.
Maintaining such a registry could be a lot of work.

However, maybe the registry could be in the form of a wiki hosted by Unicode
Inc. and people could register their own and a structured list could emerge
- (for example: animals, dogs), with Unicode Inc. keeping a watchful eye on
it, maybe having a moderation system for contributions to avoid major
problems such as someone trying to wipe everything.

Issue 5: Limiting the RGI emoji tag sequences set additions

What is the issue here? The whole point of QID emoji is that it allows
anyone to encode any emoji they choose. The RGI list is really for the
manufacturers of equipment and does not affect people just having a go and
enjoying themselves.

I liken this to font provision on major platforms. Manufacturers supply a
selection of fonts to enable people to do lots of things. However, the
underlying mechanism of how fonts are handled is that if someone wants to
buy a licence for another font from a small business and use it on his or
her computer then the underlying architecture of the computer system allows
that font to be added and used. Indeed the way that Windows 10 works is that
if I make a font myself, not as a commercial venture, just as a sort of
hobby, using a fontmaking program, then I can install that font and use it
in a desktop publishing program to produce PDF documents. I can then publish
the PDF documents on the web and send them to the British Library for legal

It could have been otherwise if the computer systems had been designed
differently with only the fonts for a particular program being usable with
that program. Then the chances of me making my own fonts and using them
would possibly have been non-existent.

So, to me, QID emoji is like me, or anyone, being able to produce and use,
interoperably, an emoji of my own specification, just as I can produce a
font of my own specification.

So, a registry that helps consumers so that they can rely on knowing that if
they buy a device from one manufacturer that they can use the emoji on it to
communicate with people who use a device from another manufacturer, is fine,
even good.

However, please take care in designing such a system that there is not
effectively a block on interoperability with other QID emoji that are not in
that list. For example, I have suggested that there could be fonts that have
a glyph for just one QID emoji. Unicode Inc. could helpfully encourage
manufacturers to include software in their devices such that if a QID emoji
not in the RGI list is received then a search is made on the internet for
such a font rather than just flagging that the particular QID emoji is not

William Overington

Saturday 25 May 2019

Date/Time: Tue May 28 18:18:36 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Ascia symbol disunification

Proposal L2/19-091, proposes to encode two symbols to represent an "ascia",
one left facing and another right facing. The ad hoc however, disagreed that
a disunification between the two wasn't merited because of the following

"While examples are provided in running printed text, there is no
contrastive use showing distinct semantics of the two. In our view, only
one symbol needs to be encoded, until contrastive use in text demonstrating
the need for two symbols is provided."

This however, is not a good reason for unification. Usage of both variants
of the symbols on the same source, is attested in figures 10, 14, 15, 19 and
23 at least (some on the same inscription even). This is no coincidence,
because the custom of prefering one layout or orientation over another is
information that may be of interest for historians, epigraphist and
philologists. Perhaps at this moment it isn't important to distinguishing
them, but the same was said of many other scribal practices. It is clear
that the inscriptions bearing both versions, did so for aesthetic reasons,
unifying both orientations makes it more likely such information would be

It is a question of foresight, because if only one symbol is encoded now and
the comitee changes its mind later, then fonts may not agree on their
prefered orientation for the symbol (since the comitee ruled them to be
equivalent). This would result in headaches for font developers and users
alike. While this scenario is unlikely to cause major problems, another
possibility is that this is taken as precedent to unify revesed variants of
characters when they shouldn't be unified, instead of as a warning against

For these reasons I mantain that the symbols shall be disunified.

Date/Time: Tue May 28 19:09:47 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Cross Patty vs Maltese Cross

In proposal L2/19-076 Everson makes the case for two things:

 1. Disunify two cross like glyps (cross patty and maltese cross) that are currently 
	unified under 2720.
 2. Encode three more characters (two being cross patties with missing pieces 
	and another for the "true maltese cross", to clarify the confusion).

While the ad hoc did not recommend to accept any new character, the comitee
accepted to encode the two cross patties with missing pieces. I am in favor
of the decision of the comitee, however there are three other steps that I
would take:

 1. Rename the two relevant characters to: CROSS PATTY WITHOUT LEFT CROSSBAR and 
 2. Change the glyph of the maltese cross to be that of the proposed "CROSS OF MALTA" 
	on Everson's proposal (this will require a glyph erratum notice).
 3. Encode a regular CROSS PATTY character with the glyph to be that of a "true" 
	cross patty, as expressed in the proposal.

The rationale for doing 1, is that these names are less ambiguos than the
current ones, since the main difference of these characters from a regular
cross patty, is that they LACK pieces, not that they have extra.

For 2 and 3, it is the fact that the creators of the dingbats, either
mistook the name or the glyph of the character, and such a mistake has been
passed down into the Unicode codecharts, when such a situation is not
necessary. Asking the creator of the font would be useful somewhat,
unfortuneatly Hermann Zapf died in 2015, and so cannot be contacted to
clarify his intent, as well as ask for his take. A possible replacement for
such an action, would be to contact the International Typeface Coorporation
and/or other friends of Zapf for their take. Type foundries can also be
asked if the change of glyph would affect them a lot.

This is different to Everson's proposal, in that it reduces the confusion of
having two very similarly named characters, in favor of bothering font
developers. This is a better solution, because while glyphs are subject to
subsecuent corrections, a character name cannot be changed (hence the
awkward formal alias system), and doing 2 while not doing 3 still benefits
citizens and historians of Malta, that don't want the confusion to continue.
Doing 3 is still justified based on the attestations provided in the
proposal. This would mean, that both glyphs would be represented and one
would not need to find "true" maltese crosses in running text, like the ad
hoc asked to. So everybody wins in the end.

Evidence of the non identity of both crosses (although sufficiently
demonstrated in the proposal), is further illustrated by the wikipedia
articles on them both:


Which note their different origin and connotations, as well as references to the 2207 character.

Date/Time: Tue May 28 19:58:14 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Retraction on my opinion of the nature of the THORN WITH DIAGONAL STROKE

In document L2/17-326 I presented my feedback on two encoding proposals, one
on the the thorn with diagonal stroke (L2/17‐236), and another for tironian
letters (L2/17‐300). Later, Michael Everson and Andrew West presented a
revised document, to just focus on the main casing pair of the tironian
letters (L2/19-172). This resulted in an updated response from my part
(L2/19-199), where I changed my mind, and agreed with them in that it makes
complete sense to have an ortographic casing pair, but still disagreed on
the exact encoding model. Since then I have exchanged correspondence with
Everson on the subject (West won't answer my emails though and Everson has
not responded to my request to add him to the chain), we are still in
disagreement and while Everson provided some points I could attack on an
updated proposal, he stopped responding to me. However, in that chain, I
commited myself to submit a contact form clarifying my updated view on the
THORN WITH DIAGONAL STROKE, hence this document.

Funnily enough my change of opinion was not promted by Everson's input, but
rather by the insight of Peter Stokes in the document L2/18-242 (it just has
taken me this long to express my change of opinion), in which he confirms,
that indeed there is a consistent pattern of glyph distinction between both
Old Norse and Old English, and that it is quite likely scholarly
publications would want to include text in both languages and the
unifiucation would result problematic to them. This effectively retires the
last valid criticism against its encoding, since there is plenty of
precedent of glyphic variants being disunified due to preferences of
distinct languages communities. Even if this semantic distinction was not
created in the middle ages, the convention is still rather old and the
unification would still remain problematic for medieval transcriptions.

Date/Time: Tue Jun 4 13:17:54 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: Emoji and Colour L2/19-203 and L2/18-198

In L2/19-203 there is section 2.9 Color.
In 2018 I submitted the document L2/18-198.
This was included in the Agenda document L2/18-182 as item E.1.7.1 but is
not mentioned in the Minutes document L2/18-183.
So when the Unicode Technical Committee considers section 2.9 of L2/19-203
could the Unicode Technical Committee please consider whether the idea
presented in L2/18-198 presents, in the opinion of the Unicode Technical
Committee, a better solution for the way to encode colour for emoji?
William Overington
Tuesday 4 June 2019

Date/Time: Tue Jun 4 18:37:44 CDT 2019
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: Feedback on Proposed QID Emoji Mechanism

I wanted to inform the UTC of some critical issues concerning the proposed
update to UTS #51 allowing emoji to be encoded as Wikidata QID tag
sequences. I fully agree with the feedback Andrew West provided in April
(cf. https://www.unicode.org/L2/L2019/19124-pubrev.html) but there are
additional points he did not address.

== Duplicate Encoding ==

If any object or concept with an associated QID can be represented as a tag
sequence, then that includes every object or concept that has already been
encoded as a regular emoji.

Mount Fuji is both the character U+1F5FB (🗻) and the sequence Q39231
(🆔󠁑󠀳󠀹󠀲󠀳󠀱󠁿); strawberries are both the character U+1F353 (🍓) and the sequence
Q14458220 (🆔󠁑󠀱󠀴󠀴󠀵󠀸󠀲󠀲󠀰󠁿). Unicode exists to transmit information in a
uniformly agreed‐upon format, so there must never be two different sequences
of codepoints representing the exact same concept unless that difference can
be folded away through normalisation. All QID sequences would be valid and
official by default even if nothing supported them; the model could not work
any other way. For the QID proposal to be usable, the standard would need to
explicitly disallow sequences that correspond to existing emoji (which
requires a steadily updated database linking all emoji to their QIDs), thus
any corporation or private person implementing such duplicate sequences
would be operating outside of the specifications.

This problem already exists to a lesser extent for emoji flags because some
regions are listed as part of both ISO 3166‐1 and 3166‐2, such as American
Samoa, which is either AS (🇦🇸) or US‐AS (🏴󠁵󠁳󠁡󠁳󠁿). It could be useful to list
such duplications in the standard to prevent overly eager implementations
from accidentally supporting two instances of the same flag at once.

== Stability ==

If any object or concept with an associated QID can be represented as a tag
sequence, then no such object or concept can ever be encoded as a regular

The QID for almonds is Q184357. Therefore, the almond emoji already exists
(🆔󠁑󠀱󠀸󠀴󠀳󠀵󠀷󠁿) and just needs to be implemented by vendors. This means an
almond character or ZWJ sequence can never be added to Unicode Emoji in the
future or else there would be duplicate encoding again. It isn’t even
possible to add emoji for things that *don’t* have an associated QID because
there is always the possibility of one being created in the future.

You cannot fundamentally change the canonically correct representation of a
piece of text without invalidating all prior versions in the process. That
is why the Unicode standard guarantees absolute stability for many important
properties. If, say, Apple decided to support an almond emoji as a QID
sequence to “test the waters” so to speak, then people with Apple devices
would use this emoji just like any other; they would include it in e‐mails,
post it on Twitter or Facebook, send it to their friends on Android phones
and so on. And unlike private‐use characters, these sequences are officially
part of the standard; they exist in the public independently of any private
agreements. If it turns out that the almond emoji is popular enough to
consider it for inclusion in Unicode, then its only possible representation
would be as that specific QID sequence because plenty of data containing
that sequence already exists; there could never be a character called ALMOND
because it would either break existing data by replacing the QID sequence in
usage, or create a situation where the same emoji is encoded twice in
mutually incompatible ways. Searching and indexing any file containing emoji
would become potentially impossible.

The QID mechanism would mean that no emoji character could ever be added to
Unicode again, no ZWJ sequence could ever be approved, and no existing
character could ever be emojified because QID sequences already cover all of
them, or could cover them in the future. This includes the entire list of
candidates for Emoji 13.

== Fallback Display and Accessibility ==

The fallback behaviour of QID sequences, like for all emoji tag sequences,
is worthless.

The tag characters are invisible by design and were encoded for
language‐based font variant substitution, something that the Unicode
standard considers unnecessary for understanding the meaning of a text, so
taking these very same characters and making them the sole carriers of
semantic content in emoji sequences was an inappropriate idea from the
start. A user could have full font support for all characters in a given
sequence and still remain completely unaware that such a sequence was even
received because its only visible component is the tag base – the only part
that does not carry any information.

The recommendations for how invalid or unsupported tag sequences should be
displayed have been part of UTS #51 for as long as the concept of tag
sequences itself, but not a single font or text renderer has implemented any
of them. The closest vendors have gotten is newer versions of Android
displaying unknown regional flags as white flags with superimposed question
marks, which is still useless but slightly less so, because at least there
is a way to differentiate flag sequences from the plain WAVING BLACK FLAG.

This issue would be amplified by QID sequences because they would also have
a meaningless tag base in addition to an undetectable tag spec. At the very
least 🏴 actually is a flag even if it doesn’t say which one, but 🆔 is
nothing at all to the end user. The general public would have to learn that
seeing SQUARED ID in a message they received probably means that their
conversation partner used an emoji that their own device does not support,
an emoji that could depict literally anything at all with little to no clues
as to its identity. There isn’t even any way to differentiate one
unsupported QID emoji from another.

Using different tag bases depending on the entity in question is also
problematic as was already discussed in the review notes for the UTS #51
update. The whole point of relying on Wikidata was to ensure that every
concept would have one and only one canonical identifier; allowing variable
bases or just picking whatever base is implemented first as the official one
(the “first mover” approach) contradicts this paradigm completely.
Furthermore, if there exists an emoji that can serve as acceptable fallback
for a QID sequence, then the emoji represented by that sequence is probably
so similar to the bare tag base that it doesn’t need to be added anyway.

Screen readers would choke on QID sequences as well. With regional flags, an
advanced screen reader could in theory read out the tag spec of unrecognized
sequences to give the user some general idea of what was transmitted (“Flag:
G, B, N, I, R”) and maybe some people would even recognise the region code,
although I am not aware of any such software presently supporting this
approach. A really advanced tool could even have an internal look‐up table
for region codes (or just a subset of popular region codes) to read out the
region’s name even if font support does not exist. With QID sequences,
however, this would not work because QIDs are meaningless on their own.
Hearing “Q184357” read out loud would probably spark more confusion than
just leaving out the emoji entirely, and a database with tens of millions of
entries means that creating a sensible subset of items to support would be
very difficult.

Feedback on UTRs / UAXes

Date/Time: Wed Apr 10 14:56:44 CDT 2019
Name: Andrey
Report Type: Error Report
Opt Subject: tr51

Ed Note: This has already been addressed in the UTS working draft. There is no open PRI at this time for UTS #51.

Could you give some explanations about Emoji EBNF and Regex.
Why Regex and  EBNF use '+' quantifier, in case of '*'?  
Basic Emoji with 1 code point will never match this regex. 

  \p{RI} \p{RI}
| \p{Emoji} 
  ( \p{EMod} 
    | \x{FE0F} \x{20E3}? 
    | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
  (\x{200D} \p{Emoji} 
    ( \p{EMod}
      | \x{FE0F} \x{20E3}? )?)+

possible_emoji :=
| zwj_element (
  (\x{200D} zwj_element)+
  | tag_modifier)

Date/Time: Wed May 8 07:11:34 CDT 2019
Name: Shinyu Murakami
Report Type: Error Report
Opt Subject: Line breaking should be possible between alphanumeric and fullwidth opening punctuation

I had reported this issue to Chromium Bugs, see
got an answer "this is UAX #14 issue".


> LB30 Do not break between letters, numbers, or ordinary symbols and
 opening or closing parentheses.
> (AL | HL | NU) × OP
> CP × (AL | HL | NU)
> The purpose of this rule is to prevent breaks in common cases where
 a part of a word appears between delimiters—for example, in 

I think the problem is that the OP of this rule includes all opening
punctuations. The solution will be to divide OP into OP1 and OP2, where 
includes normal "(" and "[" and OP2 includes fullwidth "(", "[", "「", 
(East_Asian_Width F, W and H) and change the LB30 rule "(AL | HL | NU) ×
to "(AL | HL | NU) × OP1".

Date/Time: Tue May 28 12:07:33 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: On the validity of an encoding of a QID emoji as mentioned in L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji

On the validity of an encoding of a QID emoji as mentioned in L2/19-203
Working Draft for Proposed Update UTS #51, Unicode Emoji

In L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji there
is, in section C.2, the following.


A sequence of TAG characters corresponding a Q followed by a sequence of one
or more digits, corresponding to a valid Wikidata QID representing a
depictable object.

end quote

and also the following


A subset of QIDs are associated with entities that would be valid for emoji.
For example, risk management (Q189447) and this (Q3109046) would not be
valid. Of those that are valid, Wikidata may not have associated images for
the referenced entity, and such images would rarely — if ever — be
appropriate for use as images for emoji.

end quote

I suggest that there should not be that restriction and that all QID items
should be valid for QID emoji and thus for interchange and interoperability
in a plain text environment. Some may never be used yet I am thinking that
to state that that some "would not be valid" would be a decision that could
restrict progress and the implementation and beneficial application of new
ideas in the future.

There is also the practical problem of how such a rule could be precisely

Also, the word 'depict' as defined in the Oxford English Dictionary seems to
mean that a QID emoji of each QID item would be valid under the quoted
definition from the L2/19-203 document.


This seems to come back to the issue of whether emoji can be of abstract
designs rather than just of physical objects.

In my opinion, restricting emoji to images of physical objects is
unnecessary and undesirable as it would limit creativity and opportunities
for communication of ideas.

In my opinion the expression of ideas using abstract designs is an important
part of human culture.

As it happens when we were discussing the possibility of abstract emoji some
time ago in the public mailing list I produced glyphs for "this" and for
"that" as a gentleman had indirectly suggested the possibility. They are
about 60% of the way down the following web page.


I accept that "this" as in "this and that" is not the same as "this" as used
in some computer languages, yet maybe, just maybe, a glyph for "this" used
in that context could be like my design for a glyph for "this" with a large
round dot, say in green, added in the lower right corner, so as to indicate
a dot as used in listing the name of an object in some computer programming

Restricting which QID items could be emoji also restricts the possibility of
using the QID page data for text to speech. For example, risk management
(Q189447) already has text in three languages. The encoding of abstract
items as QID items and thus as QID emoji could help communication through
the language barrier, including possibly very helpfully in emergency

I have devised a glyph for risk management.

The glyph is of a red jagged shape enclosed within a yellow rounded shape
for the colourful version. The monochrome version being of a solid jagged
shape enclosed within the outline of a rounded shape.

Shapes something like those in the following article.


William Overington

Tuesday 28 May 2019

Date/Time: Mon Jul 1 11:23:31 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Code point labels vs. names in UTS #18

Section 2.5 of UTS #18 recommends supporting code point labels like
\p{name=private-use-E000}. It allows supporting aliases not in the UCD, but
warns that they may clash with future official names or aliases. There is
nothing preventing the encoding of a character with the name
PRIVATE-USE-E000, so the same warning should apply to code point labels.
Alternatively, there could be a stability guarantee that no character’s name
will ever look like a code point label.

Date/Time: Mon Jul 1 11:50:10 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Problems with hyphens in character names in UTS #18

Section 2.5 of UTS #18 says “Name matching rules follow Matching Rules from
[UAX44].” UAX44-LM2 says to ignore all medial hyphens except the one in
U+1180. However, section 2.5.1 says to ignore hyphens when matching names
for \N, with three exceptional pairs: U+0F68 vs. U+0F60, U+0FB8 vs. U+0FB0,
and U+116C vs. U+1180, “where an extra test shall be made for the presence
or absence of a hyphen”. Is it intentional that \p{name} and \N use
different fuzzy match rules?

For example, \p{name=TIBETAN LETTER-A} matches U+0F68 TIBETAN LETTER A
because, per UAX44-LM2, a medial hyphen is equivalent to a space; \N{TIBETAN
LETTER-A} matches U+0F60 TIBETAN LETTER -A because it contains a hyphen.
This seems confusing.

For another example, \p{name=TIBETAN MARK BKA SHOG YIG MGO} matches nothing
because the hyphen in U+0F0A TIBETAN MARK BKA- SHOG YIG MGO is not medial;
\N{TIBETAN MARK BKA SHOG YIG MGO} does match U+0F0A because the hyphen is

Also, “an extra test shall be made for the presence or absence of a hyphen”
is unclear. Would \N{TIBETAN LET-TER A} match U+0F60 TIBETAN LETTER -A
because a hyphen is present?

There are more than three pairs of characters whose names differ only by a
hyphen: there are also U+11A00 vs. U+11A29, U+11A50 vs. U+11A7A, and U+11C8F
vs. U+11C88.

Error Reports

Date/Time: Tue May 7 16:29:23 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Saurashtra C2-conjoining forms

The Saurashtra section says “An exception to the non-occurrence of complex
consonant clusters is the conjunct ksa, formed by the sequence <U+A892,
U+A8C4, U+200D, U+A8B0>. [...] If necessary, U+200D ZERO WIDTH JOINER may
be used to force conjunct behavior.” That implies that the syllable “kra” in
the old-fashioned style would be formed by the sequence <ka, virama, ZWJ,
ra>. That is the opposite of the usual Indic practice (see PR #37) where
a C2-conjoining form (as in Saurashtra) is formed by <ZWJ, virama,
C2>. I recommend using <ZWJ, virama>.

I know of only two Saurashtra fonts: Pagul and Noto Sans Saurashtra. Neither
really supports C2-conjoining forms. (Pagul supports one but it doesn’t even
use ZWJ at all. Noto supports only a couple specific syllables.) So I
wouldn’t worry about breaking any existing text.

Since “kṣa” is an atomic conjunct, either order could work. Using <ka,
ZWJ, virama, ṣa> seems more consistent with other syllables, but using
<ka, virama, ZWJ, ṣa> would be more compatible with previous versions
of TUS and would allow <ka, ZWJ, virama, ṣa> to request a non-atomic
“kṣa”, so I recommend keeping “kṣa” as it is.

In any case, I recommend documenting all of this explicitly.

Date/Time: Sat Apr 27 20:56:17 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Bad advice about precomposed Egyptian hieroglyphs

The section on Egyptian Hieroglyph Format Controls says “Some Egyptian
hieroglyphs with complex structures have previously been encoded as single
characters. When glyphs for these single characters are available in the
font, the precomposed hieroglyphs should be used instead of complex
sequences of hieroglyphs with appropriate joining controls”. This makes the
encoding of Egyptian hieroglyphs depend on the choice of font, which is
inappropriate for plain text. The standard should not include the phrase
“When glyphs for these single characters are available in the font”.

Date/Time: Thu May 9 14:28:21 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: More Identifier_Type=Technical characters

The following should have Identifier_Type=Technical for consistency with 
other phonetic symbols.


Date/Time: Thu May 9 15:17:02 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Letter ra in Pali in Shan State

L2/10-345 says that in Pali as written in Shan State, the letter ra is
U+AA79 MYANMAR SYMBOL AITON TWO. UTN #11 says nothing about that, implying
that ra is actually the similar-looking U+101B MYANMAR LETTER RA. Which code
point should be used for that ra? If U+AA79 should, then it should have
gc=Lo and InSC=Consonant. If the glyph shown in L2/10-345 is just a
stylistic variant of U+101B, then no action is needed.

Date/Time: Thu May 9 16:11:03 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Ambiguous precomposed hieroglyphic quadrats with format controls

The section on Egyptian Hieroglyph Format Controls says to use precomposed
hieroglyphs when available. One of the examples in Table 11-2 is that 13217
is preferred over 13216:13216:13216. How should a quadrat of four stacked
13216s be encoded: 13216:13216:13216:13216, 13216:13217, or 13217:13216? It
is ambiguous. I think the advice should be to prefer precomposed hieroglyphs
when they are the entire quadrat, but not to use precomposed hieroglyphs in
a complex cluster with format controls.

Date/Time: Sun May 19 18:19:24 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of U+10A0D KHAROSHTHI SIGN DOUBLE RING BELOW

Identifier_Type=Technical|Exclusion. It is not a technical character. It
should just have Identifier_Type=Exclusion.

Date/Time: Thu May 23 05:27:36 CDT 2019
Name: Mike FABIAN
Report Type: Error Report
Opt Subject: The Emoji ZWJ Sequence “people holding hands”, which appeared in Emoji-12.0/Unicode-12.0 has “10.0” in the comments in emoji-zwj-sequences.txt

The emoji zwj sequence:

    1F9D1 200D 1F91D 200D 1F9D1

“people holding hands” first appeared in


it is not in:


So apparently this was added in 12.0.



contains 10.0 in the comments:

$ grep "people holding hands" emoji-zwj-sequences.txt
1F9D1 200D 1F91D 200D 1F9D1                 ; Emoji_ZWJ_Sequence  ; people holding hands                       # 10.0  [1]
1F9D1 1F3FB 200D 1F91D 200D 1F9D1 1F3FB     ; Emoji_ZWJ_Sequence  ; people holding hands: light skin tone      # 10.0  [1]
... ETC ... 
1F9D1 1F3FF 200D 1F91D 200D 1F9D1 1F3FF     ; Emoji_ZWJ_Sequence  ; people holding hands: dark skin tone       # 10.0  [1]

That seems to be a mistake.

Date/Time: Mon Jul 1 12:50:01 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Underdefined code point label syntax

Subsection “Code Point Labels” in section 4.8 of The Unicode Standard says
“code point labels are constructed by using a lowercase prefix derived from
the code point type, followed by a hyphen-minus and then a 4- to 6-digit
hexadecimal representation of the code point.” This is a mite ambiguous. May
the hexadecimal representation use lowercase letters or fullwidth
characters, since they match \p{Hex_Digit}? Is the hexadecimal
representation allowed to have extra leading zeros, e.g.
<control-000009>? (This matters because code point labels are part of
UTS #18 syntax.)

Date/Time: Thu Jul 4 23:01:44 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Misleading description of Manichaean ligatures

The Manichaean section says “Manichaean has two obligatory ligatures for
sadhe followed by yodh or nun”, but they are not obligatory and they are not
always ligatures. See https://github.com/googlefonts/noto-fonts/issues/1550 
for details and evidence.

Date/Time: Mon Jul 8 02:10:57 CDT 2019
Name: Ivan Timokhin
Report Type: Error Report
Opt Subject: Inconsistency in name derivation rules ranges

Ed Note: This has already been fixed for next version.

There appears to be a mismatch between ranges for name derivation rules
listed in Table 4-8 of the Standard v12.1 (p. 185,
https://www.unicode.org/versions/Unicode12.1.0/ch04.pdf) and the contents of
UnicodeData.txt (https://www.unicode.org/Public/12.1.0/ucd/UnicodeData.txt).
Namely, Table 4-8 contains ranges 4E00..9FEA for "CJK UNIFIED IDEOGRAPH" and
17000..187EC for "TANGUT IDEOGRAPH", whereas corresponding ranges in
UnicodeData.txt end at 9FEF and 187F7 respectively. Furthermore, code charts
actually contain entries for the rest of these ranges, and, for what it's
worth, ICU appears to report names generated by the corresponding rules for
these characters. This, together with the disclaimer at the beginning of
Chapter 4, suggests to me that it is the Table 4-8 that is incorrect.

Date/Time: Mon Jul 15 10:31:38 CDT 2019
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Kanbun block property versus implementation issue

With regard to the 16 characters in the Kanbun (漢文) block, the last 14 of
them, U+3192 through U+319F, have the <super> (superscript) property,
and include a compatibility decomposition (NFKC and NFKD) to a CJK Unified


These characters are referred to as "kaeriten" (返り点), and are used to
annotate Chinese texts (aka Kanbun). They have been in Unicode from the
beginning (Version 1.0), and are not included in any JIS standard, other
than in JIS X 0221 that is effectively a clone of ISO/IEC 10646. Very few
non-Japanese fonts include glyphs for these characters, because their use is
Japanese-specific. However, the glyphs for these characters in the vast
majority of Japanese fonts are provided at full size with the expectation
that the layout engine reduce them to half size and position them
appropriately. Some typefaces, such as the Hiragino families and Kozuka
Mincho, provide generic glyphs for these characters that do not vary by
weight, but other typefaces include glyphs that do vary by weight. Their
glyphs may or may not be identical to their corresponding CJK Unified
Ideographs, but all such known implementations use separate GIDs (Glyph IDs)
for them.

These characters are similar to Kenten (圏点) characters whose glyphs are also
provided at full size with the same expectation from the renderer. Adobe
InDesign, from Version 1.0J, supports the typesetting of Kenten characters,
and U+FE46 ﹆, and is bundled with a dedicated font that provides their
glyphs (at full size).

In terms of Unicode, the Kanbun subsection of Section 18.1, Han, of the Core
Specification (page 720) states only the following about this particular

"This block contains a set of Kanbun marks used in Japanese texts to
indicate the Japanese reading order of classical Chinese texts. These marks
are not encoded in any other current character encoding standards but are
widely used in literature. They are typically written in an annotation
style to the left of each line of vertically rendered Chinese text. For
more details, see JIS X 4051."

JIS X 4051:2004 is still the latest version of that referenced standard,
which was reconfirmed on 2018-10-22, and JLREQ, which implemented much  of
the functionality of that standards, provides no information about Kanbun.
Section 5 of JIS X 4051 (pp 35 through 44) covers Kanbun, and states the
following in Section 5.4:

返り点,送り仮名及び読み仮名の文字サイズ: 返り点,送り仮名及び読み仮名の文字サイズは,漢文の漢字の文字サイズの 1/2
とする。処理系定義により,返り点,送り仮名及び読み仮名は,漢文の漢字の文字サイズの 1/2 以下としてもよい。

The above statement seems very clear that the glyphs for kaeriten (返り点),
along with those for okurigana (送り仮名) and yomigana (読み仮名), are to be scaled
to one-half size (or smaller).

See Noto CJK Issue #159 for some spirited discussion, which includes some
unfortunate misunderstandings on my part (facepalm):


My question to the UTC, and to character property experts in particular, is
whether the normative <super> property and associated decomposition,
which are in conflict with how virtually all Japanese fonts supply the
glyphs for the Kanbun block at full size with the expectation that they be
scaled and positioned as appropriate, is an issue in a practical sense, or
at some level. I am asking, because we're facing what I consider to be two
effective non-starters here: 1) changing the property value; and 2) changing
hundreds of Japanese fonts.

I do plan to change the typefaces under my control, meaning the Noto CJK
(and Source Han) families and Kozuka Gothic, to use generic glyphs for these
characters that do not vary by weight, but otherwise have no plans to adjust
their size or positioning.

Date/Time: Mon Jul 15 17:48:53 CDT 2019
Name: Jaemin Chung
Report Type: Feedback on an Encoding Proposal
Opt Subject: Some incorrect radical-stroke values in Ext G

Some radical-stroke values in Extension G are incorrect.

U+30059: 7.11 → 7.12
U+300E3: 12.8 → 12.9
U+3010E: 15.15 → 15.13

If reordering of characters is still possible, these pairs need to be swapped:
U+30059 and U+3005A
U+300E3 and U+300E4

Date/Time: Mon Jul 15 20:10:14 CDT 2019
Name: Eiso Chan
Report Type: Public Review Issue
Opt Subject: Comment on the RS for U+3010E (GZ-4571201) in WG2 N5100R2

There is no need to change the RS value for U+3010E (GZ-4571201), because we
have confirmed to count the SC of the component 争 as the one of 爭 in IRG,
that they are the unifiable case. 

Date/Time: Tue Jul 16 23:01:39 CDT 2019
Name: William T. Nelson
Report Type: Error Report
Opt Subject: U+2DCE7 glyph issue

The glyph for U+2DCE7 𭳧 (KC-06432) in the CJK extension F code chart renders
incorrectly in some applications on macOS, including Preview, Safari, and
Firefox. This glyph renders correctly on iOS (all apps) and in Chrome for

Here's a screenshot:


I flipped through the charts and found more cases. 
Here are all of them:

block code_point char_ref
U U+6C08 K0-6E7D
U U+860B V1-6568
B U+25706 UCS2003
B U+25990 UCS2003
B U+2620F UCS2003
B U+26286 UCS2003
B U+2822D UCS2003
B U+28F17 UCS2003
B U+29F4A UCS2003
E U+2CC56 GXC-3023.15
F U+2DCE7 KC-06432


Date/Time: Tue Jul 16 12:27:04 CDT 2019
Name: Jaemin Chung
Report Type: Feedback on an Encoding Proposal
Opt Subject: More radical-stroke value errors in Ext G

I found more radical-stroke value errors in Extension G.

U+30A12: 113.12 → 113.10 (should be moved between U+30A0F and U+30A10)
U+30A3F: 115.15 → 115.16 (U+30A3F and U+30A40 should be swapped)
U+30AD0: 119.12 → 119.11 (no reordering needed)
U+31173: 188.11 → 188.12 (no reordering needed)
U+311CB: 195.21 → 195.23 (U+311CB and U+311CC should be swapped)

Date/Time: Tue Jul 23 14:00:41 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of IPA characters

Some characters have Identifier_Type = Uncommon_Use that are actually in
common use as IPA symbols. They should have Technical but not Uncommon_Use.


Date/Time: Tue Jul 23 14:09:43 CDT 2019
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Incorrect script code in UAX 24

Unicode® Standard Annex #24, UNICODE SCRIPT PROPERTY, contains the phrase
"so it is assigned an scx set value of {Hira Kata}".

"Kata" is an incorrect script code. The ISO 15924 script code for Katakana
is "Kana", which is already used correctly in the table above the incorrect

Other Reports

Date/Time: Sun Jul 7 13:41:43 CDT 2019
Name: Ken Martin
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: LOINC codes Unicode emoji Skin Type

I'm not sure if you are aware, but LOINC recently released their new set of
codes.  I had submitted Unicode emoji Skin Type modifiers and Fitzpatrick
Skin Type questions to LOINC.  The Fitzpatrick questions will probably be
released in December, but the Unicode emoji skin type modifier codes are
out.  I believe these are the first LOINC codes for emojis.

I looked at the emojis for pain scales, but the faces appeared to differ in
the literature.  Perhaps this is a project you can do, since pain emojis are
clinically useful.

You can view the complete Skin type emoji codes in LOINC.  But the codes

89843-7 	Unicode Emoji skin tone modifier

Source: 	Unicode, Inc.
SEQ#  	    	Answer  	    	                                   Answer ID
1    	 	Light skin tone
                Unicode: 1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2
                Description: Emoji Modifier Fitzpatrick Type-1-2           LA29279-9

2    	 	Medium-light skin tone
                Unicode: 1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2
                Description: Emoji Modifier Fitzpatrick Type-3             LA29280-7

3    	 	Medium skin tone
                Unicode: 1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
                Description: Emoji Modifier Fitzpatrick Type-4             LA29281-5

4    	 	Medium-dark skin tone
                Unicode: 1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5
                Description: Emoji Modifier Fitzpatrick Type-5             LA29282-3

5    	 	Dark skin tone
                Unicode: 1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6
                Description: Emoji Modifier Fitzpatrick Type-6             LA29283-1