Comments on Public Review Issues
(August 03 - October 20, 2017)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of August 3, 2017, since the previous cumulative document was issued prior to UTC #152 (August 2017). Some items in the Table of Contents do not have feedback here.


The links below go directly to open PRIs and to feedback documents for them, as of October 9, 2017.

Issue Name Feedback Link
359 Proposed Draft UTR #53, Unicode Arabic Mark Ordering Algorithm (feedback)
358 Proposed Update UTS #10, Unicode Collation Algorithm (feedback) No feedback to date
357 Proposed Update UAX #44, Unicode Character Database (feedback) No feedback to date
356 Proposed Update UTS #51, Unicode Emoji (feedback)
355 Proposed Update UAX #29 Unicode Text Segmentation (feedback)

The links below go to locations in this document for feedback.

Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports

Note: The section of Feedback on Encoding Proposals this time includes:
L2/11-359  L2/12-309  L2/15-338  L2/17-190  L2/17-238  L2/17-255  L2/17-303  L2/17-339  L2/17-366  L2/17-372  L2/17-380  L2/17-382 


Feedback to UTC / Encoding Proposals

Date/Time: Sat Aug 12 16:33:45 CDT 2017
Name: Eduardo Marin Silva
Report Type: Error Report
Opt Subject: On tally marks in vertical text

I make this form since I cannot wait for the next UTC.

In this document: http://www.unicode.org/L2/L2017/17255-script-ad-hoc.pdf the
ad hoc comittee discussed my proposal concerning tally marks. I argued for
three thinghs, the addition of named character sequences to unambiguosly refer
to two, three, and four, changing the name of the tally marks to refer to
their shape, instead of leaving them as the apparently sole system of tally
marks and making the tally marks rotate in vertical text.

• The first request is only for convinience, if they fell the hassle of
getting the sequences enconded are not worth the benefits that's fine.

• The name change I believe while not necessary per se, would take out
ambiguity of what tally mark one is refering to, particulary since there are
still unencoded "box tally marks" used in South America, and users there may
fell that the names were not assigned fairly.

• As for the tally marks in vertical text, I do not need to present evidence
of their usage there, because one just has to consider what happens when one
attempts to enter them in such an enviroment. While number five would occupy
a cell as expected, the number four would occupy four cells (four times as
tall), meanwhile the rotated glyphs would only occupy a narrow band. Maybe
some higher order software would be able to force the four characters to
occupy the same cell, but the fact remains that the fallback will remain
unnaceptable. Asking for evidence is like asking for instances of VERTICAL
LINE in vertical contexts, even though it is obvious that the rotated glyph
is more desirable.

Date/Time: Wed Aug 9 09:45:21 CDT 2017
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Anglicana w and paleographic variation L2/17-238

L2/17-238 amply demonstrates that two forms of w were used in the same
manuscript. It doesn’t show that the distinction was meaningful or anything
more than scribal whim; I conclude the character is meant for use by
paleographers, for whom scribal whims are significant.

This is not the only character for which a single manuscript has multiple
interchangeable glyphs. For example, “Le roman de la rose” (MS. Douce 195)
uses two versions of d. I suggest that the UTC create a policy for
paleographic variants in general before encoding this single variant. For
example, are manuscripts enough evidence, or should there be evidence from
modern works, to show that paleographers do distinguish the glyphs in plain

Date/Time: Thu Aug 17 07:17:57 CDT 2017
Name: Christoph Päper
Report Type: Public Review Issue
Opt Subject: Categories of Emoji Draft Candidates (WG2 N4904 / L2/17-366)

X+1F9A0 Microbe is categorized as Objects / tool. It should be in the Animals
& Nature category, perhaps in a new subcategory. X+1F9EC DNA Double Helix is
categorized as Objects / tool. It should be either in the Animals & Nature
category or in the Symbols category, perhaps in a new subcategory. X+1F9F9
Broom is categorized as Objects / other-object. It belongs to the tool

Several other candidates are lumped in together within Objects / other-object.
They should get at least one new subcategory. This would be either 'household'
or 'hygiene' (X+1F9F4 Squeeze Bottle / Lotion, X+1F9FB Toilet Paper, X+1F9FC
Soap, X+1F9FD Sponge), although related emojis are found in Travel & Places /
hotel, and Activities / craft (X+1F9F5 Thread, X+1F9F6 Yarn, X+1F9F7 Safety

X+1F9F8 Teddy Bear may fit better within Activities / game, since there is no
toy subcategory.

Furthermore, I'd like to suggest reconsidering the subcategories of some
existing emojis within the Travel & Places category:

U+1F3D9 City Scape should move to similar ones in place-other, or into a new
subcategory place-scenery. U+1F3B0 Slot Machine is better found in Activities
/ game. U+1F3A8 Artist Palette should be moved to Objects / tool as it does
not indicate a place. Or move it to a new subcategory Activities / art
together with (at least) U+1F3AD Perorming Arts and U+1F5BC Framed Picture.
U+2668 Hot Springs better fits with other signs in Symbols / transport-sign.
U+1F6D1 Stop Sign also belongs in Symbols / transport-sign.

If the Unicode Consortium had the resources, you should conduct several card
sorting sessions to determine categories and orders that actually feel natural
to people.

Date/Time: Tue Oct 3 20:01:27 CDT 2017
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Feedback on L2/17-339

L2/17-339 “Revised chart of Naxi Dongba characters” has some problems. I
found these by skimming; I probably missed many. The discrepancies between
character names and phonetic transcriptions should be solved by
automatically deriving the former from the latter.

On page 17, the phonetic transcription of character 75 is “bjə³¹”. It is one
of only two transcriptions to include “j”. It should probably be “biə³¹”.

On page 22, the phonetic transcription of character 103 is “bv̩³³cjə⁵⁵”. It
is the only transcription to include “cj”. It should probably be

On pages 32 and 85, the phonetic transcriptions of characters 152 and 417
include “ɣ” but their names are written as if those syllables had no onsets.

On page 140, the names and phonetic transcriptions of characters 691 and 692
are mixed up.

On page 143, the name of character 709 does not match its phonetic
transcription: “TV” vs. “tʰv̩³¹”.

On page 157, the name of character 776 has an extra “DONGBA CHARACTER”.

On page 187, the name of character 927 does not match its phonetic
transcription: “SEEL” vs. “sɿ³³”.

On page 206, the gloss of character 1020 should be “flag”, not “flab”.

On page 211, the references of characters 1045 and 1046 are swapped.

On page 220, the names and phonetic transcriptions of characters 1092 and
1093 are mixed up.

On page 229 ff., the names of the characters are missing a space in

No character for ‘eight’ is listed. In fact, ‘eight’ is /ho⁵⁵/, which is
presumably why the character currently named DONGBA CHARACTER HOL includes
eight short vertical lines. It is odd that that number should be missing
from the repertoire when other small numbers are included.

Date/Time: Sat Oct 7 15:32:20 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: n4904 feedback: Lao glyphs and names (L2/17-366)

The glyphs for the new letters to be added, are not harmonious with the rest
of the glyphs in the codechart, overall the stroke width is thinner in the
new characters. It is also not clear why the names of the letters need the
adjective "Pali" and "Sanskrit" in them, since none of the names would clash
if those words were dropped (in the case they did, it should be limited to
the letters where the names actually clash). At the very least the name for
the Virama should be changed to LAO SIGN VIRAMA, because it is meant to
create new sounds regardless of language and there has never a precednet to
name a virama after a language.

Date/Time: Sat Oct 7 15:47:53 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: n4904 feedback: VEDIC SIGN JIHVAMULIYA glyph (L2/17-366)

Since the glyph for this character has changed it is now confusable with its
Kannada counterpart (dotted box and all), to avoid complications it should
be noted when it is preferable to use one instead of another. 

Given what we have found out about the use of VEDIC SIGN ARDHAVISARGA and
its rotated version in Nandinagari, it is clear that they were meant to be
spacing characters, I'm not sure if a change of properties would be
warranted or even possible, but it is something to consider.

Date/Time: Sat Oct 7 16:46:13 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: n4904 feedback chess notation symbol names and Group mark issues (L2/17-366)

PASSED PAWN SYMBOL, not only does it make the intended use clearer, the
somewhat redundant informative aliases can be dropped. Having the more
general names, runs the risk of confusion. Also the Group mark has a cross
refernce to the DOUBLE DAGGER, when it should be referencing the TRIPLE
DAGGER (2E4B). And the character THERMODYNAMIC 29E7 in the codecharts lacks
the informative alias: Record Mark

Date/Time: Sat Oct 7 16:55:56 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: n4904 feedback: DOUBLE OBLIQUE HYPHEN WITH FALLING DOTS (L2/17-366)

As proposed it is not clear when that punctuation mark should be used (does
it share a function with the double oblique hyphen?), it at least should
mention its Cornish origins.

Date/Time: Sat Oct 7 17:21:24 CDT 2017
Name: Eduardo Marín Silva
Report Type: Error Report
Opt Subject: n4904 feedback: Pinyin uppercase letters (L2/17-366)

The chart does not mention their lowercase counterparts (they should also be
references to them in the charts of the lowercase letters) this should be
true for any cased letter pair were the corresponding letters are not
adjacent to each other.

Date/Time: Sat Oct 7 17:25:06 CDT 2017
Name: Eduardo Marín Silva
Report Type: Error Report
Opt Subject: n4904 feedback: NEWA LETTER VEDIC ANUSVARA glyph (L2/17-366)

The glyph looks nothing like the original version in the proposal and it
lacks harmony with the rest of the Newa letters.

Date/Time: Sun Oct 8 18:56:38 CDT 2017
Name: Eduardo Marín Silva
Report Type: Error Report
Opt Subject: n4904 feedback: New emoji part 1 (L2/17-366)

I start by saying that I already critized certain emoji proposals in the
document: https://www.unicode.org/L2/L2017/17303-emoji-notes.pdf So here I
will only elaborate on the ones I didn't touch in that document.

The bagel is sliced to visually distiguish it from a donut, but what I don't
understand is why it was approved, considering that it's just a piece of
bread shaped differently. But assuming there is a good reason for its
inclusion, I suggest keeping the same glyph but just naming it BAGEL,
because font developers have access to color, it is not likely they would
necessarily represent it sliced, so having the more general name gives them
that option.

The Mammoth is not in my opinion sufficiently distinct from ELEPHANT to
merit separate enconding, but assuming it is, I take issue with the fact
that it is proposed to be a generic indicator of great size, when the WHALE
or the SAUROPOD characters are much more suited for that purpose.

A skunk is much more suited in my opinion, than a badger to be encoded, due
to its disctinctive connotation of bad odor and the phrase "drunk as a
skunk", they have also appeared as famous character like the Looney Tunes
and Bambi. Badgers only became famous due to a viral video featuring a song
repeatetly saying the name, and certain memes saying "honey badger don't
care". So a badger is a lot more transient. Enconding both a badger and a
skunk runs the risk of confusion due to their similar apperance.

Date/Time: Sun Oct 8 20:06:02 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: n4904 feedback: New emoji part 2 (L2/17-366)

It is a terrible idea to encode the character BILLIARD GAMES as it is, a
much more constructive approach is to do the same as what was done with
FLYING SAUCER. Instead of letting two separate but semantically related
glyphs to be used (in this the ALIEN character, both for a face and the
saucer) a second character to denote the saucer was encoded. I agree with
the Irish national body, original suggestion of returning the original glyph
to the character BILLIARDS and encode a separate 8 BALL character. This
makes sures that the ones who actually did the correct thing by using the
original glyph, instead of conflating it, are rewarded. Also the name of the
original character will not be misleading. The only downside is that it may
open up the doors for a proposal to up to 21 different balls (including the
white ball but excluding the eight ball), along with a different set of 7
colored balls without number for snooker, along with a cue stick, pool table
and chalk, but I don't see that as problematic as long as all of those
symbols are kept in a separate block.

The firecracker emoji looks like dynamite, and while separate enconding of
dynamite is debatable, if one wants to represent firecrakers properly, one
needs to represent them in a line, so I propose changing the glyph to
represent the character LINE OF FIRECRACKERS.

There is no reason why the JIGSAW PUZZLE PIECE shouldn't be a solid color
(either all black or all white in this case).

The petri dish contains visble microbes, wich makes it seem redundant with
the microbe emoji, instead it should look like actual cultures in the
macroscale (a ink blot like field with dots on it should be enough).

The BASKET is way to broad, if one want to represent laundry, it should say
BASKET WITH CLOTHES, but if one just wants to represent a basket, then the
glyph should not include the heap that is present in the current one.
Personally I would prefer one character for HAND BASKET and another for PILE
OF CLOTHES, this allow for greater expresions to be made like gathering,
shopping in the first case and overall untidiness for the second, both emoji
could then be used in succesion to indicate laundry.

Also the coin should have accepted into the repertoire due to its
connotation with chance and small amounts of money, there are several
machines that only accept coins as input, and the so called piggy banks are
designed for coins.

Date/Time: Wed Oct 11 18:58:49 CDT 2017
Contact: corbett.dav@husky.neu.edu
Name: David Corbett
Report Type: Feedback on an Encoding Proposal (L2/17-369)
Opt Subject: Indic_Syllabic_Category of Newa jihvamuliya and upadhmaniya

L2/17-369 proposes that the Newa jihvamuliya and upadhmaniya have 
Indic_Syllabic_Category=Consonant_Prefixed. Based on the manuscript sample 
in figure 1, a better category would be Consonant_With_Stacker. 
Consonant_Prefixed is for superjoined consonants, whereas Consonant_With_Stacker 
is for full-sized consonants to which subsequent consonants are subjoined.

Date/Time: Thu Oct 12 22:09:02 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: On the Khitan block and Miao sign Nukta

It is my opinion that the repertoire for the Khitan small script is pretty 
exhaustive, and therefore there is not much need for extra space. The twelve 
unallocated code points should suffice for future discoveries, if the two extra 
columns are allocated, then it is likely it will be unused codespace that could 
have been used for other scripts. The new range should be 18B00-18CDF.

The MIAO SIGN NUKTA in my opinion would be better allocated in 16F8E, since 
not only is it closer to other combining marks but it leaves one more space 
for a future letter (5 instead of 6), yet it still leaves 6 codepoints for 
other vowel signs so it is almost like a balance.

Date/Time: Mon Oct 16 13:59:51 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Note on NEPTUNE FORM TWO

A reference to the character U+2646 NEPTUNE, is missing in the chart for NEPTUNE FORM TWO.

Date/Time: Wed Oct 18 20:40:31 CDT 2017
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Kashmiri digits forms should be mentioned in core spec

The core spec, in the text surrounding "Table 9-2. Glyph Variation in
Eastern Arabic-Indic Digits" and the table itself, should mention that
Kashmiri also uses the Eastern digits, and the digit shapes are identical to
the Urdu forms.

This is important for font support for Kashmiri, which is one of the 22
scheduled languages of India. It would help font designers find about the
need to add localized forms for Kashmiri in their fonts.

Date/Time: Thu Oct 19 11:19:28 CDT 2017
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Feedback on the Basque flag emoji

L2/17-382 proposes either U+1F1EA U+1F1F0 or U+1F3F3 U+FE0F U+200D U+2733
for the Basque flag. The first (EK) is not a Unicode region subtag. The
second is a combination of a flag and a dingbat, which sets a bad precedent,
because not all flags happen to look like existing emoji. I suggest 🏴espv✦
instead, which is already defined though not RGI.

Date/Time: Fri Oct 20 07:15:23 CDT 2017
Name: Christoph Päper
Report Type: Other Question, Problem, or Feedback
Opt Subject: L2/17-380 ESC report 2017Q3

Separate​ ​topics​ ​for​ ​discussion

Other​ ​documents

> > * L2/17-296 — Comments on Recently Approved Emoji Candidate Names
> >     - Probably too late for name change, ESC consider for future CLDR names/keywords

It is definitely not too late to discuss and make changes to character
names. Unicode 11 is not even in beta yet. After the recent quarrel with
WG2, the UTC should make clear that all of its members and committees
understand that, so this can be resolved to guarantee fruitful cooperation
once again.

> > * L2/17-303 — Notes on emoji proposals
> >     - ESC considered human-form vs smiley for superhero/villain. Did not think 
		the 12-18x cost for humanform emoji was worth it.

Emoticons are faces, mostly to represent facial emotions as well as some
actions (e.g. sneezing) and some features (e.g. glasses). There have been
cases of image sets, e.g. in early-2000s forum software or desktop instant
messengers, that used the classic 1970s yellow Smiley face or a non-
copyrighted variant thereof to represent all sorts of feelings, actions,
stereotypes, animals etc., see the defunct Smileyworld.com or now
<http://www.smiley.com/emoticons/dictionary>. If those would be remade
to use Unicode, they would certainly show a Superhero and a Supervillain
character with a face-centric design, but that is a design choice, not a
character choice. These emojis must be person/human-form. A Masked Face
emoji would be slightly different.

> >     - Suggestion of POO + ZWJ + <face> was reasonable, but probably too late in the process.

This draft character should be postponed until there is a binding and
lasting decision how to deal with additional emotional faces. The cat face
emojis were included for legacy reasons, but this one would open up a can of
worms for new requests. Please take your time to devise a more generic
solution! This could be in the form of new combining emojis for facial
properties (like Smiling Eyes, Open Mouth etc.) that could form sequences
(without ZWJ) with almost any emoji character that has FACE in its Unicode
designation (and perhaps some more).

> >     - SMILING FACE WITH SMILING EYES AND THREE HEARTS got a higher priority due to request data.
> >     - Other faces weren’t considered distinct enough, or high enough priority.

This very character had been dismissed in 2016. A ZWJ sequence, e.g. with
U+1F49E 💞, was deemed more appropriate, but actually neither recommended
(RGI) nor even documented.

> > * L2/17-376 — “Top of Head” Emoji feedback
> >     - These characters are already recommended as components.

An important point is that hair color (only red and white here) is quite a
different thing than hair style (curls) or its entire absence, yet the
proposed solution treats them alike. You could add Wig emojis and use them
for ZWJ sequences, but they would still need to be available *at least* in
black, brown, blond, ginger, gray and white variants via some method,
because right now the hair color in person emojis depends on the skin color
which is not a deterministic relationship in reality. Changing one's hair
color through dye is also a lot simpler and more common than making changes
to one's skin color. Tattoos are permanent, but hardly ever applied as solid

Forward​ ​RGI/Emojification​ ​requests​ ​to​ ​UTC


> > * L2/17-343 — Infinity Emoji Submission
> >     - Note that Samsung has emoji presentation for this.

This is not true. Samsung has emoji presentation for U+267E Permanent Paper
Sign which includes an infinity symbol, because Samsung systematically has
emoji representations for *all* characters in the Miscellaneous Symbols
block U+26xy.

> >     - Note that Samsung has emoji presentation for this.

This is true, but since this is a standardized map symbol showing it as a
gravestone is a misguided decision. If this was acceptable, e.g. U+1F3E6
Bank could have been unified with U+26FB Japanese Bank Symbol and U+26EF Map
Symbol for Lighthouse could be rendered as a Lighthouse emoji.

Tag Sequences

It certainly makes no sense to make flag emojis for all 50ish states of the
US become RGI without proven demand whereas only select ones from other
countries. Then again, the whole concept is flawed as fruitlessly discussed
in 2016.

ZWJ​ ​Sequences

> > * MAN/WOMAN + ZWJ + <new hair styles>

My counter-proposal would be to treat U+1F471 Person with Blond Hair and
perhaps U+1F487 Person Getting a Hair Cut as well as U+1F9D4 and an altered
X+1F9B1 Person with Curly Hair as the only emojis whose primary feature is
hair and therefore would be subject to generic color modifiers. These could
be done with existing heraldic tincture hatching pattern U+25A0,3..9
characters, i.e. they would only need the emoji property to become Swatches
and no new codepoints (except that arguably some are missing, see L2/11-094,
L2/16-318 and <https://github.com/Crissov/unicode-

> > * L2/17-389 Mike Drop / Mic Drop
> >     - ESC was neutral on this. Could be too trendy.

It probably fails the Fad criterion indeed. You should consider a hand
gesture that can clearly be used for dropping something, though.

> > * Others from L2/17-287 Section: ZWJ Sequences
> >     - Recommended against
> >         + Heart with knife
> >             * Unnecessarily violent; can be conveyed with existing sequence

I would have expected a debate whether it should use a dagger or a kitchen
knife, but rejecting something that thousands if not millions of peaceful
people have tattooed on their bodies as "unnecessarily violent" is just
inappropriate, especially when it indeed represents sadness instead.

ISO​ ​Character​ ​Requests

The ESC definitely needs to publish its ranked list of possible future
animal emojis ("our omnibus collection of animals") rather sooner than later
and it needs to say for which ones they already received proposals and the
reason why they have not progressed (yet).

That being said, the non-extinct animal characters proposed by WG2 seem like
an arbitrary selection that needs extension. ESC should not wait for
individual proposals to come in, but instead develop a list of animals that
are culturally relevant and distinguished throughout the world. These should
then be discussed by UTC and ISO/IEC and encoded all at once. Then be done
with it except for single additions once in a while when new evidence of
relevance has surfaced, just as with characters in any non-pictographic
script in Unicode.

> > 3. 1F9A4 SQUIRREL

I understand and partially share the reluctance to encode this one, because
Chipmunk is so similar. Please also consider the possibility of Squirrel
Face instead.

> > 7. 1F97B TROLL
> >     b. Was one of 64 emoji in proposal “64 Complementary Emoji”).
> >     f. Might set precedence for other “emotes” used by gamers (twitch) 
		and rage faces; style would be out of place for emoji.

The document titled "64 Complementary Emoji" has apparently never been
published to the L2 registry.

Nobody is seriously requesting the addition of the copyrighted graphic known
as the "Troll Face Meme" or any other "Rage Face".

Date/Time: Mon Oct 23 07:52:59 CDT 2017
Name: Christoph Päper
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/17-381 Scuba emoji

The proposal for a Scuba Emoji mentions no reason for compatibility encoding, 
but there actually is. While the original unified Japanese emojis were a superset 
of (much less successful) WAP Pictograms in general, they were lacking a 
substitute for its `/sport/scuba` entry. There arguably may be other omissions 
(`animal/beetle`, `emotion/shakenHeart`, `map/zoo`, `music/rest`), but this one 
has no substitute whatsoever and should definitely have a Unicode character 
assigned to it.

 [Diving emoji]: https://github.com/Crissov/unicode-proposals/issues/179
 [WAP Pictograms]: https://github.com/Crissov/unicode-proposals/issues/260
 [WAP Pictogram Specification]: http://www.openmobilealliance.org/tech/affiliates/wap/wap-213-wapinterpic-20010406-a.pdf

Date/Time: Mon Oct 23 22:20:28 CDT 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Names of two astrological symbols

The name PROSERPINA could be confused to mean the comet called that way,
even though the proposal says they are not related. A better name may be
ASTROLOGICAL PROSPERPINA. Or at the very least an annotation indicating its
true nature.

There has never been a need to encode any other astronomical symbol with the
"FIRST FORM" prefix.I suggest just calling 2BF0 ERIS, that makes it obvious
that that symbol takes precedence over the second one, and they are not both
intechangable (in the proposal they weren't, hence the need of separate

Feedback on UTRs / UAXes

Date/Time: Sat Jul 29 21:52:56 CDT 2017
Name: Timothy Gu
Report Type: Error Report
Opt Subject: Issues with UTS #46's conformance test file

To whoever it may concern,

While developing a product conforming to "UTS #46: Unicode IDNA Compatibility
Processing, Version 10.0.0" [UTS46], we noticed a few issues with the provided
conformance testing file (IdnaTest.txt). These issues are preventing us from
implementing UTS #46 in tr46.js [TR46JS-ISSUE].

The IdnaTest.txt file is formatted as a list of semicolon-separated values.
The meanings of the specific columns are given in UTS #46 Section 8.1, an
excerpt of which is hereby reproduced [UTS46]:

> > No Field      Description
> > ...
> > 3  toUnicode  The result of applying toUnicode to the source, using "nontransitional".
> >               A blank value means the same as the source value; a value in [...] is a set of error codes.
> > 4  toASCII    The result of applying toASCII to the source, using the specified type: T, N, or B.
> >               A blank value means the same as the toUnicode value; a value in [...] is a set of error codes.
> > 
> > ...
> > 
> > An error in toUnicode or toASCII is indicated by an error list of the form [...]. In such a case, the 
> >   contents of that list are error codes based on the step numbers in UTS46 and IDNA2008:
> > 
> >     ...
> >     An for Section 4.2 ToASCII, step n
> >     ...

Given that "An" applies only to the ToASCII algorithm, not the ToUnicode
algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never
have an error code of form An. Yet, in the published IdnaTest.txt file
corresponding to version 10.0.0 [IDNA-TEST], there exist 305 entries in
IdnaTest.txt where an "An" error code appears under "toUnicode". In
particular, there exist 36 entries with _only_ an "An" error code under
"toUnicode" -- which, in other words, means that the only justification for
erroring on those entries from ToUnicode is not actually in ToUnicode.

This is particularly troubling, since while the Standard allows for ADDITIONAL
error cases than ones already specified in IdnaTest.txt, a product conforming
to UTS #46 must produce an error on ALL error cases in IdnaTest.txt, per lines
68-72 of IdnaTest.txt, again reproduced below:

> > ... Thus to then verify conformance for the toASCII and toUnicode columns:
> > 
> > - If the file indicates an error, the implementation must also have an error.
> > - If the file does not indicate an error, then the implementation must either have an error, or must have a matching result.\

A close examination of the 36 entries mentioned above reveals that:

- 9 of the 36 entries have only "[A3]" error code under ToUnicode, which
corresponds to the Punycode-encoding step in ToASCII. The source domains all
have one label with invalid Punycode-encoding though, so they would in fact
have already recorded an error in no. 4 of Processing Steps, which is called
upon by ToUnicode as well. In other words, these entries merely have a
faulty error code; ToUnicode would still record an error for these entries,
just one at a different step than advertised.

  Some samples from these 9 entries are:

  Line 313: B;	xn--0.pt;	[A3];	[A3]
  Line 315: B;	xn--a-Ä.pt;	[A3];	[A3]
  Line 316: B;	xn--a-A\u0308.pt;	[A3];	[A3]

- The other 27 entries have only a "[A4_2]" error code under ToUnicode, which
corresponds to the DNS length verification step under ToASCII. Some of them

  Line 201: B;	。;	[A4_2];	[A4_2]
  Line 202: B;	.;	[A4_2];	[A4_2]
  Line 434: B;	a..c;	[A4_2];	[A4_2]
  Line 439: B;	ä.\u00AD.c;	[A4_2];	[A4_2]

  While these domain names are all rather unlikely to be allowed by real-world
  UTS #46 implementations, most (if not all) of them are still strictly
  allowed by ToUnicode as defined in UTS #46.
  Take line 201, for example. Step 1 of ToUnicode call into the Processing
  Steps, whose step 1 will map '。' to '.', and which will then pass through
  the rest of Processing Steps without recording an error. Step 2 of ToUnicode
  will then produce a "converted Unicode string" of '.', and signal there was
  no error.

The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since
they seem to go against the algorithms defined in UTS #46, and prevent us from
creating a strict implementation of UTS #46 without passing its own
conformance tests.

To resolve these issues, I would like to see the following:

- A clarification whether the aforementioned 27 entries should record an error in ToUnicode.
- Corresponding changes to IdnaTest.txt or UTS #46 that accompany that clarification.
- There be no entries in IdnaTest.txt with a ToUnicode error code that point to steps in ToASCII.


Timothy Gu

[UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html
[IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt
[TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13

Date/Time: Thu Aug 10 10:31:18 CDT 2017
Name: Ken Lunde
Report Type: Error Report
Opt Subject: UAX #45 datafile suggestion

While not an error, I propose that the UAX #45 datafile be annotated for
each version, at least since becoming a UAX (Version 6.3.0), to indicate the
version number and the number of characters that were added, such as via the
following comment lines (the "<nnn entries omitted>" lines are meant to make
what follows easier to understand):

UTC-00001;E;U+2B88A;4.2;0082.031;;kCowles 4762
<950 entries omitted>
# Version 6.3.0 Additions: 245
UTC-00953;UNC-2013;;167.10;1318.281;⿰钅哥;UTCDoc L2-12/333 204
<243 entries omitted>
UTC-01197;N;;1.6;0078.131;⿱合一;UTCDoc L2/13-009 19
# Version 7.0.0 Additions: 1
UTC-01198;N;;1.8;0078.171;⿳人伊一;UTCDoc L2/13-009 20
# Version 8.0.0 Additions: 3
UCI-01199;U;U+2F949;109.7;0809.030;⿰目夾;UTCDoc L2/14-260
UTC-01200;N;;85.10;0643.241;⿰氵恩;UTCDoc L2/15-109
UTC-01201;N;;112.5;0829.331;⿰⽯示;UTCDoc L2/15-114
# Version 9.0.0 Additions: 1,768
UTC-01202;H;;8.6;0089.191;⿱㐭水;UTCDoc L2/15-177 1
<1,766 entries omitted>
UCI-02969;U;U+2BDA4;46.18;0322.391;⿰山⿱㞌⿰㞌㞌;TUS U+2BDA4
# Version 10.0.0 Additions: 6
UTC-02970;N;;157.9;1230.291;⿰足迷;UTCDoc L2/16‐066 1
UTC-02971;N;U+3779;42.8;0297.231;⿱少免;UTCDoc L2/16‐066 2
UTC-02972;N;U+2F8A4;61.9;0396.071;⿰忄柬;UTCDoc L2/16‐269 1
UTC-02973;N;;9.9;0112.071;⿰亻革;UTCDoc L2/16-239R 1
UTC-02974;N;;116.8;0866.551;⿱穴卑;UTCDoc L2/16-239R 2
UTC-02975;N;;142.11;1098.441;⿰虫崩;UTCDoc L2/16-385R 1

As UAX #45 grows, this will make it easier to determine when a particular 
character was added without going back to previous versions.

Date/Time: Thu Sep 28 20:41:52 CDT 2017
Name: Pedro Navarro
Report Type: Public Review Issue
Opt Subject: UTR #50 property value for U+2026


According to the UTR #50 data file, U+2026 HORIZONTAL ELLIPSIS is marked as
'R' which means it should be rotated when in vertical (the same happens with
U+2025). I've tried several Japanese fonts (Noto, Heisei Maru Gothic) and
they provide a vertical variant for it. Shouldn't the property value for
those characters be, instead, 'Tr'? Or are we to consider that a
particularity of the font?


Error Reports

Date/Time: Mon Jul 31 13:05:32 CDT 2017
Name: Marcel Schneider
Report Type: Other Question, Problem, or Feedback

Iʼve just got aware that the Auspicious sign subheading in the new 
Nandinagari block is not as good as I thought it to be when 
reviewing for PRI #353. Unfortunately this is now closed, but Iʼll 
send you this anyway off PRI (leaving it to your convenience 
whether to add the below, or not). I think that this item could be 
processed as well at beta review, where it could be sent again.

The reason why this seems important to me, is to make aware that 
the proposers didnʼt aim at doing differently, but can be meant 
as being aware of the existing usage in the Standard, and seeking 
consistency as far as feasible. Exhibiting some ability of being 
original does not make sense to me. 

(You may read between these lines that whey I change things for the 
French translation, itʼs really because I see a need of more 
accuracy and better consistency, for an overall greater usefulness 
and increased reputation of the Standard among its users. While 
opening thus a presumably positive reputation gap in favor of the 
French translation of the repertoire, I regularly try to propose 
the changes to your attention in case Unicode might wish to 
implement part or all of them for a final equalization.)

Best regards,

Iʼve just got aware that the Auspicious sign subheading for 
11BD2 NANDINAGARI SIGN SIDDHAM could as well be Invocation sign.
As the encoding proposal states: “The sign [SIDDHAM] is used as an 
invocation at the beginning of documents.” 
For consistency with other instances in the Standard such as 
U+A8FC, one could actually wish to replace “Auspicious sign” with 
“Invocation sign” in the future Nandinagari block. 

Date/Time: Wed Aug 2 13:30:57 CDT 2017
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: Inappropriate remark in draft

In http://www.unicode.org/L2/L2017/17190-n4824-pdam1-3chart.pdf:

• used to abbreviate units of measure

http://www.unicode.org/L2/L2015/15338-n4706-nko-additions.pdf: "For instace,
it is used with  ka as  to abbreviate  kúdɛ ‘kilometre’, with  fa as 
to abbreviate  fele ‘megametre’, with  gba as  to abbreviate 
gbàlàgbala ‘metre’, with  sa as  to abbreviate  sidɔ ‘gram’, and with 
ta as  to abbreviate  tóngba ‘litre’. Examples of letters with
DANTAYALAN connected to another letter are  gbaw. ‘mm.’ and  gbach. ‘cm.’.
(See Figures 1, 2, 3.)" (In the paste from the PDF, some chars got botched.)

While much of this text in n4706 is highly objectionable by itself, that is a
separate issue.

However, hinting (in Unicode charts) ["used to abbreviate units of measure"]
that SI "short forms" for units are abbreviations is a MAJOR misunderstanding.
The SI "short forms" are SYMBOLS (made from letters). They ARE mnemonic, but
they are NOT(!!!) abbreviations. In this lies, among other things, natural
language *independence*. While they must be language independent, it is
understandable if one wants to transliterate the unit symbols to the "local
script". That will still not make the SI unit symbols abbreviations, and the
the transliteration scheme must respect the design of the SI unit symbols
(prefixes, etc.), which the examples in n4706 appear not to do.

Besides, all other scripts appear to manage just well without having a special
underline (or similar) to  mark unit denotations. This points to NKO
DANTAYALAN being a bad idea to begin with.

Date/Time: Mon Aug 21 08:21:32 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Indic_Syllabic_Category of U+0A51

U+0A51 GURMUKHI SIGN UDAAT should have an Indic_Syllabic_Category. It is a
tone mark, but it goes before any vowel sign. Its proposal document says “In
many ways, Udaat should be treated as a subjoined consonant”, so I suggest

Date/Time: Wed Sep 27 19:24:13 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Underspecified Ahom vowel signs

An Ahom consonant may take multiple vowel signs, all of which have ccc=0.
The Unicode Standard does not say what order they should be encoded in. The
proposal (L2/12-309R) recommends an order, but contradicts itself: on page
2, it says U+1172A AHOM VOWEL SIGN AM should precede U+11724 AHOM VOWEL SIGN
U, but on page 3, it gives the opposite order. It is therefore unclear what
the intended order is.

Date/Time: Mon Oct 2 07:41:44 CDT 2017
Name: Jonathan Kew
Report Type: Error Report
Opt Subject: Character missing from IndicSyllabicCategory.txt

It appears that U+0980 BENGALI ANJI is missing from
IndicSyllabicCategory.txt, although as an expected base for U+0981 BENGALI
SIGN CANDRABINDU, it seems like it really should appear.

(The proposal for U+0980, http://unicode.org/L2/L2011/11359-bengali-
(anji.pdf, confirms that <0980, 0981> is a valid cluster for the script.)

Date/Time: Mon Oct 2 09:04:57 CDT 2017
Name: Jonathan Kew
Report Type: Error Report
Opt Subject: Inconsistency in IndicSyllableCategory data

It seems logical that all the "Marks of nasalization" at U+A8F2 to A8F7
would have the same Indic category; AFAICS they all behave/render similarly.

But currently the IndicSyllablicCategory.txt file classifies two of them as

A8F2..A8F3    ; Bindu # Lo   [2] DEVANAGARI SIGN SPACING

but leaves the remainder uncategorized. Is there any good reason for this,
or should they be harmonized?

Date/Time: Tue Oct 3 09:03:28 CDT 2017
Name: Jonathan Kew
Report Type: Error Report
Opt Subject: Indic Syllabic Category value Gemination_Mark should be subdivided

It looks to me like the Gemination_Mark category should probably be split.
Currently, there are three characters with this property in

0A71 ; Gemination_Mark # Mn GURMUKHI ADDAK
11237 ; Gemination_Mark # Mn KHOJKI SIGN SHADDA
11A98 ; Gemination_Mark # Mn SOYOMBO GEMINATION MARK

However, AIUI the Gurmukhi mark is different from the other two, in that it
indicates gemination of the following consonant, whereas the others indicate
gemination of the preceding consonant. This suggests that GURMUKHI ADDAK
would follow any matras etc on the preceding consonant and appear at the
very end of a cluster, whereas the Khojki and Soyombo marks (and the
Gujarati one U+0AFB that should be treated similarly) belongs immediately
after the consonant it modifies, and precedes vowel matras. They're
functionally quite different, and fit into the syllable structure in
different places.

Date/Time: Sat Oct 7 13:20:17 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Obsolete alias for U+1039 MYANMAR SIGN VIRAMA

U+1039 MYANMAR SIGN VIRAMA has the names list alias “killer (when rendered
visibly)”. It should not: it is never rendered visibly (except for fall-back
rendering like a subscript plus sign, which doesn’t count). This is left
over from the pre-5.1 version of Myanmar, before the visible killer was
disunified as U+103A MYANMAR SIGN ASAT. Now that U+1039 is purely an
abstract subjoiner without any glyph of its own
(Indic_Syllabic_Category=Invisible_Stacker), that alias is obsolete and

Date/Time: Sun Oct 22 08:45:39 CDT 2017
Name: Charlotte Buff
Report Type: Error Report
Opt Subject: Typo in Proposed Character Name (L2/17-372)

The character U+10F45 SODGIAN PHONOGRAM SHIN in the proposed Sogdian block 
(see http://www.unicode.org/wg2/docs/n4872-DAM1chart.pdf, page 78) has a 
typo in its name. The script identifier is spelled SODGIAN with the G and D 
switched. It should be SOGDIAN PHONOGRAM SHIN.

Date/Time: Tue Oct 24 10:31:03 CDT 2017
Name: Brienna Carter
Report Type: Error Report
Opt Subject: Error in Name of Emoji

Dear Unicode,

Today I was texting on my MacBook, composing a message to a friend that was
coupled with an emoji to make my tone more explicit. Upon hovering over
emojis to choose, I discovered that you can see what each one is defined as.
This feature spurred me to leap into the endeavor of looking at the various
labels of emojis. It was all fun and games looking at some oddly specific
descriptions and finally figuring out what a "part alternation mark" is
until I came across what I thought was a sneaker or–as you and pockets of
mid-western United States call them–tennis shoes. Personally, I was offended
by this finding. I am fully aware that the two terms "sneaker" and "tennis
shoe" are colloquially synonyms, yet the term sneaker is much more generic
and wholly acceptable. A tennis shoe technically points to a shoe used for
the sport of tennis. By accepting this title as the principle label for this
type of shoe, we run into many problems. The first being how we
differentiate between an actual shoe used for tennis and the general term
tennis shoe; it is simply awkward and unacceptable to call such a "tennis
tennis shoe." Tennis shoe also reminds us of the history of sneakers: a shoe
once worn principally for athletics. Today, tennis shoes/sneakers are worn
on a daily basis for just everyday life. The term "sneaker" is more
accepting of this modern-day fashion statement. As a whole, the United
States (not the only users of your emojis but a large portion of English
speakers who do) uses the word "sneaker" much more frequently. In fact, it
is searched over "tennis shoe" by the majority on Google in each state
except Mississippi. Therefore, the majority should rule and Unicode should
conform. Until two emojis exist, the sole emoji of a white and gray shoe
should be defined as a "sneaker."

Brienna Carter

Other Reports

Date/Time: Wed Oct 11 14:56:49 CDT 2017
Name: Ken Lunde
Report Type: Other Question, Problem, or Feedback
Opt Subject: UTS #37 suggestion

In response to WG2 N4829 Section 12, "Request for Consideration of Relaxing
IVD Rules: IRG M48.3 with reference to Part B of IRGN2219," I propose that
the following text be appended to the fourth paragraph of UTS #37 Section 2,
"Description," or as a separate paragraph that immediately follows the
fourth paragraph:

In an effort to reduce the number of encoded variants, the unification rules
for unified ideographs, when applied to the IVD, have been expanded to
include cases whereby 1) characters that have a different structure, but
whose difference is not considered significant enough to encode them as
separate unified ideographs, and for which strong evidence associating them
as variants of encoded characters can be provided, such as ⿱汨皿 versus ⿰氵昷
(U+6E29 温) and ⿱戠火 versus ⿹戠火 (U+243B7 𤎷); and 2) characters with the same
structure, but with different components at the second (or subsequent) level
that may not be generally unifiable, and for which strong evidence
associating them as variants of encoded characters can be provided, such as
⿺𠃊西 versus ⿺辶西 (U+8FFA 迺) and ⿰月㲋 versus ⿰月𣬉 (U+818D 膍). When 
considering the second case, the character should be rarely used and not in 
general circulation, and the registrant is expected to provide evidence that 
demonstrates 1) similarity of glyph shape; and 2) general acceptance as a