L2/14-179

Comments on Public Review Issues
(May 2 - July 29, 2014)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 29,2014, since the previous cumulative document was issued prior to UTC #139 (May 2014). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Grayed-out items in the Table of Contents do not have feedback here.

Contents:

The links below go to directly to open PRIs and to feedback documents for them, as of July 29, 2014. Gray rows have no feedback to date.

IssueNameFeedback Link
278 Proposed Update UTR #50, Unicode Vertical Text Layout (feedback)
277 Reconciling Script and Script_Extensions Character Properties (feedback)
276 Feedback on repertoire for ISO/IEC 10646:2014 (4th Edition, Amendment 2) (feedback)
273 Proposed Update UTS #39, Unicode Security Mechanisms (feedback)
272 Proposed Update UTR #36, Unicode Security Considerations (feedback)

The links below go to locations in this document for feedback.

Feedback on Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports

 


Feedback on Encoding Proposals

None at this time.


Feedback on UTRs / UAXes

Date/Time: Thu May 22 08:01:18 CDT 2014
Name: Anne van Kesteren
Report Type: Public Review Issue UTS#46
Opt Subject: Domain syntax

I just wanted to clarify something with regards to the review note at the end
of http://www.unicode.org/reports/tr46/proposed.html#Implementation_Notes

What I'd really like to see is a syntax description. E.g. a domain consists of
domain labels separated from each other by domain label separators, optionally
with a trailing domain label separator. A domain label is a sequence of one or
more code points which are one of X, Y, and Z. A domain label separator is one
of X, Y, and Z. Alternatively you could express this using ABNF or some kind
of grammar.

That is the kind of thing people writing validators or authoring tools are
often looking for. And often web developers as well. They don't want to have
to put some input they made up through a series of functions before they know
whether the input is valid. I guess another way of saying this would be having
a declarative description of a domain.

(This is an open issue https://www.w3.org/Bugs/Public/show_bug.cgi?id=25334
for the URL Standard.)

Date/Time: Wed May 28 15:54:40 CDT 2014
Name: Richard Wordingham
Report Type: Error Report UTS #18
Opt Subject:
Definition of Unicode Set in Unicode Regular Expressions

Unicode Technical Standard #18 'Unicode Regular Expressions' Revision 17 refers to Unicode 
sets, but does not define them.  I have been told that the definition is meant to be taken 
from UTS#35, the LDML specification, and that there ought to be a cross-reference to that 
definition.

Section 1.3 of UTS#18 contains two examples, 
"[\p{L}--QW]" and "[\p{Assigned}--\p{Decimal Digit Number}--a-fA-Fa-fA-F]", 
which appear not to conform to the LDML syntax.  Further details are given at 
http://unicode.org/cldr/trac/ticket/7507 .

Date/Time: Sat Jun 7 14:23:13 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible typo in UTR #31

Hello,
In http://www.unicode.org/reports/tr31/ clause R7 says:

"R7 Filtered Case-Insensitive Identifiers
To meet this requirement, an implementation shall specify either simple or full case 
folding, and adhere to the Unicode specification for that folding. Except for identifiers 
containing excluded characters, allowed identifiers must be in the specified Normalization Form."

Is a Normalization Form truly meant here or is it a case-folding form?

Thanks,
Dmitry S.

Date/Time: Wed Jun 11 18:50:32 CDT 2014
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Inconsistency wrt/ variation selectors in UAX 31

Unicode Standard Annex 31, UNICODE IDENTIFIER AND PATTERN SYNTAX, is 
inconsistent in its description of variation selectors:

- Section 2.3 describes the risks associated with variation selectors 
(and other default-ignorable characters), and says “Variation selectors ... 
are not included in the default identifier syntax”, and “default-ignorable 
characters are normally excluded from Unicode identifiers”.

- Section 2, however, includes all nonspacing marks into ID_Continue, and 
does nothing to exclude variation selectors, which are nonspacing marks. 
And indeed, DerivedCoreProperties.txt does have the entries

180B..180D    ; ID_Continue # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
FE00..FE0F    ; ID_Continue # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
E0100..E01EF  ; ID_Continue # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

Date/Time: Fri Jun 13 22:36:38 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in paragraph 3.6 of UTS #18 Unicode Regular Expressions


Hello, In section "3.6 Context Matching"
http://www.unicode.org/reports/tr18/#Context_Matching there is a typo in the
table with examples: the last column of the last two rows contains a string
"ca not" which should be corrected to "cannot".

Thanks,
Dmitry S.

Date/Time: Tue Jun 17 14:46:22 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in UTS #10 Unicode Collation Algorithm


Hello,
There is a typo in section "3.8.1 Default Values" of UTS #10 Unicode Collation Algorithm 
(both 6.3.0 and 7.0.0): in the last sentence of the first paragraph it is written as follows:
"The unmarked characters will a3) equal to MIN3."
It seems that this should be corrected to the following: "The unmarked characters will 
have a3 equal to MIN3."

Thanks,
Dmitry S.

Date/Time: Wed Jun 18 15:40:40 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible error in UTS #10 Unicode Collation Algorithm


Hello,
in UTS #10 Unicode Collation Algorithm version 7.0.0 clause S2.1.2 
(http://www.unicode.org/reports/tr10/#S2.1.2) there seems to be an error
in a note below the clause:

"Note: A non-starter in a string is called blocked if there is another non-starter
of the same canonical combining class or zero between it and the last character of 
canonical combining class 0."

The "... non-starter of the same canonical combining class OR ZERO..." part seems 
erroneous to me because of the following:

1) UAX #15 http://www.unicode.org/reports/tr15/#Description_Norm defines non-starter 
as follows: "Most characters (including all non-combining marks) have a Canonical_Combining_Class 
value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters 
are referred to by a special term, starter. Only the subset of combining marks which have 
non-zero Canonical_Combining_Class property values are subject to potential reordering by 
the Canonical Ordering Algorithm. Those characters are called non-starters."

2) D107 Starter definition in the Unicode Standard: "D107 Starter: Any code point (assigned 
or not) with combining class of zero (ccc=0)."

The latter excerpts imply that a non-starter cannot have Canonical_Combining_Class value of 
zero (ccc=0) which stated otherwise in the note mentioned.

Thanks,
Dmitry S.

Analysis of the above report by Ken Whistler, 2014/06/18:

O.k., yes, this *is* a problem in wording, and it is non-trivial to
fix.

The note in question goes at least back to Version 4.0 of UTS #10,
although its position in the text migrated a bit later on. In the
UTS #10 4.0 version, it is:

Note: A combining mark in a string is called blocked if there is 
another combining mark of the same canonical combining class or zero 
between it and the last character of canonical combining class 0.

right below Step 2 in Section 4.2. It logically refers to Step 2.1.2,
which is where the note was later moved.

Then a comedy of errors ensues. In later versions of the text,
the note was updated by replacing "combining mark" with "non-starter",
without adjusting the text "or zero" correctly.

But wait! It gets worse. This text, which was derived from the 4.0 version
of UAX #15, where it defined starter for normalization, was not then
adjusted for Corrigendum #5 (from February, 2005!), which inserted the
wording "or higher" in the definition of blocked in UAX #15. And disconnected
as it was, it then certainly did not follow the later move of all the
definitions related to normalization *out* of UAX #15 and into Chapter 3
of the core spec (as of Version 5.2.0). And when they went into Chapter 3,
the wording for "starter" was essentially unchanged, but the wording
for "blocked" got a complete overhaul.

So my conclusion is that all of the wording about starter and blocked in
UTS #10 needs a serious update, to make correct references to the
*current* definitions in Chapter 3, rather than using ad hoc, out-of-date
definitions from 2005 derived from a long-superseded version of UAX #15.
Doing *that* will require some significant work on this section of the
text.

--Ken

Date/Time: Thu Jun 19 11:18:19 CDT 2014
Name: Addison Phillips
Report Type: Error Report
Opt Subject: Bad example in Figure 2, UAX#15

Figure 2 in UAX#15 (Normalization Forms) contains examples of different types
of "compatibility equivalence". The second line in this table is for "breaking
differences" and shows the hyphen-minus character as the example. However, the
only example I can find in TUS or the UCD of a "breaking difference" that is a
case of compatibility decomposition (in fact, it is cited in Chapter 2 of TUS)
is between U+00A0 (non-breaking space) and regular space.

While it's really difficult to illustrate different kinds of space characters
in a table, perhaps using a placeholder ("NBSP", "(non-breaking space)", etc.)
might work? Or maybe add some attendent prose to explain the table?

Note: The term "breaking difference" appears nowhere else that I can find in
UAX15 or in the relevant sections of TUS related to compatibility
decomposition.

Date/Time: Sat Jun 21 19:05:39 CDT 2014
Name: Samuel Bronson
Report Type: Error Report
Opt Subject: UAX #11: refers to biwidth fonts as "legacy"

In UAX#11, you say:

>> An important class of fixed-width legacy fonts contains glyphs of just two widths, 
with the wider glyphs twice as wide as the narrower glyphs.
I don't think it's correct to think of all such fonts as "legacy": such fonts tend to be 
popular with programmers, and I get the impression that, say, Japanese people usually like 
text to be typeset on a grid, too.

(Granted, the ones that make characters fullwidth *just* because they are encoded using 
two bytes in some encoding or other are a bit silly.)

If we could only get sensible wcwidth() values even for latin/punctuation/math characters and 
make the fonts to match, we'd *really* have something ... say, making EM DASH perceptibly wider 
than HYPHEN-MINUS?

Date/Time: Sat Jun 28 07:52:44 CDT 2014
Name: Diego Perini
Report Type: Other Question, Problem, or Feedback
Opt Subject: Correction for #Validity_Criteria UTS #46


There is a small syntax error in:

http://www.unicode.org/reports/tr46/#Validity_Criteria

the text:

"2 - The label must not contain a U+002D HYPHEN-MINUS character 
in both the third position and fourth positions."

Should be changed to:

"2 - The label must not contain a U+002D HYPHEN-MINUS character 
in both the third and fourth positions."


Date/Time: Mon Jul 14 00:05:39 CDT 2014
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UTS18 typo


The final line in Section 1.2 should be
\p{Script_Extensions=Katakana}
NOT \p{Script_Extensions=Hiragana}

Date/Time: Mon Jul 14 15:29:43 CDT 2014
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UAX #38 kDefaultSortKey should distinguish traditional vs. simplified radicals

UAX #38 says:
2.1 Database design
kDefaultSortKey
"Bits 23-30 are the character’s KangXi radical number used [...] The difference 
between simplified and traditional radical is ignored."

This appears to be incorrect: The Han code chart
(http://www.unicode.org/charts/PDF/U4E00.pdf) shows that the forms of the
radicals are distinguished. For example, the characters with radical 120
(silk) are grouped together, and followed by the group of those with radical
120' (silk/C-simplified). See the chart at U+7CF8 and U+7E9F.

I expect that most if not all of the main Unihan block (4E00..9FFF) should
follow the kDefaultSortKey order. If this expectation is not intended to be
true, it should be documented for kDefaultSortKey. (I assume that possible
exceptions would be due to corrections of the Unihan data since the original
allocation.)

I suggest to either restate the default sort key as something other than int
bit fields (with the added distinction), or else using unsigned int (32-bit)
or long (64-bit) bit fields, adding one bit for traditional (0) vs. simplified
(1).

Given the existing action items for kDefaultSortkey ([139-A19a], [139-A21], 
see http://www.unicode.org/review/pri266/feedback.html)
I suggest to simplify it as follows:

Use a 64-bit integer with a less dense and therefore less error-prone encoding:

Bits 20.. 0  code point (avoids complications re [139-A19a])
Bit  23      set to 0 if the code point is U+4E00..U+FFFF,
             else set to 1
             ([139-A21], UCA implicit weights BASE FB40 vs. FB80)
Bits 29..24  residual stroke count (0..63)
Bit  30      set to 0 if traditional radical form (e.g., 120),
             set to 1 if simplified (120')
Bits 39..32  radical number (1..214)

Date/Time: Thu Jul 31 22:00:08 CDT 2014
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: WD UTR #51 Unicode Emoji

The <title> says "UTS #51". It's not a UTS. Please change to "Working Draft UTR #51".

Section 1 Introduction is good, but I feel strongly that the section on Longer
Term Solutions should follow right after, rather than late in the document.

The document points to at least one doc in unicode.org/~scherer/ -- we should
copy that into a permanent location, for example reports/tr51/.

I suggest deleting 1.2 Goals. It duplicates some of the ToC; it says that the
material is subject to change (as usual); and the last sentence "This document
does not discuss..." should be merged into the Summary at the top which
partially contradicts it.

5 Sorting -- I am personally a bit skeptical about the need for sophisticated
sorting *among* symbols, including Emoji.

6 Searching -- this is useful information, but very different from "search" as
in UTS #10, for example, and it covers a variety of methods. This makes the
heading misleading. Please rename to "Input Methods" or "Selection Methods" or
similar.

Data charts: It would be useful to repeat the column headings once in a while,
at least in long, multi-column tables as in full-emoji-list.

Error Reports

Date/Time: Thu May 15 17:29:05 CDT 2014
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: TUS: Special Cases with Malayalam RA

NOTE: The editorical committee has already looked at this feedback. Some of the items are complete, and the committee is dealing with other issues in the Malayalam block intro.

TUS 6.2/6.3 Section 9.9 ‘Special Cases Involving ra’ has a number of
problems and errors.

1)       The title should say ‘rra’, not ‘ra’.

2)       The following paragraph leaves the impression that <0D31,
 0D31> might be treated as a unit in rendering.  The paragraph  
 following it needs to dispel that impression.

“Repetition of the letter, written either റ്റ or ററ, is also used for
the sound /tt/. The sequence of two റ letters fundamentally behaves as
a digraph in this instance. The digraph can bear a vowel sign in which
case the digraph as a whole acts graphically as an atom: a left vowel
part goes to the left of the digraph and a right vowel part goes to the
right of the digraph. Historically, the side-by-side form was used
until around 1960 when the stacked form began appearing and supplanted
the side-by-side form. As a consequence the graphical sequence ററ in
text is ambiguous in reading. The reader must generally use the context
to understand if this is read /rr/ or /tt/. It is only when a vowel
part appears between the two റ that the reading is unambiguously /rr/.
Note that similar situations are common in many other orthographies. For
example, th in English can be a digraph (cathode) or two separate
letters (cathouse); gn in French can be a digraph (oignon) or two
separate letters (gnome).”

3)       The following paragraph is false.  For example, <0D31, 0D31,
 0D46> is rendered as ററെ.  
“The sequence <0D31, 0D31> is rendered as ററ, regardless of the reading
of that text. The sequence <0D31, 0D4D, 0D31> is rendered as റ്റ. In
both cases, vowels signs can be used as appropriate, as shown in Table
9-31.”

To address this and the previous problem, I suggest replacing it
by:

“The sequence <0D31, 0D31> is rendered as ററ, possibly with the
incorporation of vowel signs between, regardless of the reading of that
text.  A vowel appearing on the left must be encoded after the first
occurrence of 0D31, and a vowel appearing on the right must be encoded
after the second occurrence of 0D31.   Two-part vowel characters may
not be used with the side-by-side digraph.  The sequence <0D31, 0D4D,
 0D31> is rendered as റ്റ, and vowel signs are encoded after it.  
Examples are shown in in Table 9-31.”

Date/Time: Sun Jun 29 06:33:12 CDT 2014
Name: Claus Faerber
Report Type: Error Report
Opt Subject: Inconsistency between IdnaMappingTable.txt and IdnaTest.txt

Hi,

I'm the author of the perl module Net::IDN::Encode (available on CPAN), which
uses automated testing based on the IdnaTest.txt data file provided with
Unicode. After updating to Unicode 7.0.0 (module version 2.200), some of the
tests fail on a Unicode-enables perl (v5.21.1).

This seems to be caused by inconsistencies in the data files provided with Unicode: 

For example, consider lines 4827 and 4828 in IdnaTest.txt:

B;      🌱.𐋱₂;   [P1 V6];        [P1 V6]
B;      🌱.𐋱2;   [P1 V6];        [P1 V6]

These strings contain  '🌱' (U+1F331) and '𐋱' (U+102F2). The later is new in
Unicode 7.0.0. The first string also contains '₂' (U+2082), the second '2'
(U+0032), both of which output as '2' (U+0032).

The tests indicate that processing should throw error P1 or V6, which would
indicate that the strings contain invalid characters.

However, according to the IdnaMappingTable.txt, all of the characters in these
strings are 'valid' (although they would not be valid under IDNA 2008):

2082          ; mapped                 ; 0032          # 1.1  SUBSCRIPT TWO
102E1..102FB  ; valid                  ;      ; NV8    # 7.0  COPTIC EPACT DIGIT ONE..COPTIC EPACT NUMBER NINE HUNDRED
1F330..1F335  ; valid                  ;      ; NV8    # 6.0  CHESTNUT..CACTUS

Only characters new in Unicode 7.0 seem to be affected. If I change the module
to treat all characters added in Unicode 7.0 as 'invalid', all tests are
successful.

I think the error is in IdnaTest.txt but I'm not completely sure.

Other Reports

Date/Time: Fri Jun 20 13:12:37 CDT 2014
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Glyph for U+1F44E THUMBS DOWN SIGN potentially wrong

The glyph for U+1F44E THUMBS DOWN SIGN may better show the back of the hand, as it's 
actually very hard to make such a gesture as shown.

Looking at the source glyphs at L2/09-027R2
(http://www.unicode.org/L2/L2009/09027r2-emoji-backgrnd.pdf),
it appears that the SoftBank glyph shows the back of the hand for
this character, while KDDI shows the front.

(From https://code.google.com/p/android/issues/detail?id=71948)

Date/Time: Tue Jun 24 09:22:05 CDT 2014
Name: Daniel Klein
Report Type: Other Question, Problem, or Feedback
Opt Subject: Normalisation of Indic scripts

Hi!

I was normalising some text into Form D with mixed Latin and Sinhala
characters and I was surprised that the Sinhala mark for "o" was decomposed
into "e" and "aa" (which is how it's typed on a Sinhala typewriter). I realise
that the character looks exactly like the other two combined but they don't
render the same as two characters (the combining ring is present) and have a
very different phonological meaning. e.g. කොළ (ක + ො + ළ) "kola" (green) &
කොළ (ක + ෙ + ‍ා + ළ) an impossible spelling (and probably pronunciation) of
"keaala" (no such word in Sinhala).

I checked on http://www.unicode.org/charts/normalization/chart_Sinhala.html
and noticed three other characters, too.

It seems to me the same as decomposing "d" into "cl" because if you combine
them they look the same. Also, "℅" does not become "c/o" in Form D, only in
Form KC, as well as other related symbols. I'm not sure that these Sinhala
characters should ever be decomposed, even in Form KD as it changes the
spelling, meaning, appearance and pronunciation of the words they appear in.

I had a quick look at Tamil and noticed the same thing. I would imagine that
this is the case for most Indic scripts in Unicode (almost all write "o" as a
combination of a preceding "e" and a following "aa").

Even more problematic is ෝ "oo" as ‍ා + ් never combine except with ‍ෙ. කෝ (ක
+ ෝ) vs කෝ (ක + ෙ + ා + ්).

If, however, you think I am wrong (there must have been a reason for doing it
this way) I would love to know the rationale. The only thing I can think of is
to maintain compatibility with proprietary encodings that don't have a
separate character for "o" but render all characters as they appear visually
but this seems like a bad idea to me as the text should be converted to
Unicode correctly in the first place.

Regards,

Daniel


// Addendum, July 20:

Hi Rick,

I happened to find the following in NamesList.txt:
@               Two-part dependent vowel signs
@+              These vowel signs have glyph pieces which stand on both
sides of the consonant; they follow the consonant in logical order, and
should be handled as a unit for most processing.
0DDC    SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA
	= sinhala vowel sign o
	: 0DD9 0DCF
0DDD    SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA
	= sinhala vowel sign oo
	: 0DDC 0DCA
0DDE    SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA
	= sinhala vowel sign au
	: 0DD9 0DDF

The important bit is "should be handled as a unit for most processing".
I believe that the current behaviour of normalising these characters
into their lookalikes goes against this statement.

Cheers,

Daniel


(Note: This came through the Unicode mail list:)

From: Benjamin Riefenstahl
Subject: Problem with Mandaic shaping, IT and IN switched
Date: Mon, 30 Jun 2014 22:47:39 +0200

Hi everybody,

I am currently in the process of designing a simple OpenType font for
Mandaic.  As some of you are probably aware, shaping in OpenType as it
is recommended by the OpenType standard requires that the application
(i.e. the text rendering engine) knows the joining behaviour of the
characters.

It seems that there is an error in the joining data for Mandaic as
defined by the Unicode standard (table 14-5 and 14-6, chapter 14.12 in
version 6.3) and by the file ArabicShaping.txt at
http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt.

The tables list the character IT as dual-joining and the character IN as
right-joining.  These two seem to be switched.  In the table columns
with the actual characters (columns Xn, Xr, Xm, Xl) the correct
characters are given (compare the code chart at
http://www.unicode.org/charts/PDF/U0840.pdf), but the names (and the
relative positions in the tables) are wrong and that error is than taken
over into the file ArabicShaping.txt:

   0847; MANDAIC IT; D; No_Joining_Group
   [...]
   084F; MANDAIC IN; R; No_Joining_Group

The correct characters in the table should be (in this order)

  * Dual-Joining: ATT, AK, AL, AM, AS, IN, AP, ASZ, AQ, AR, AT
  * Right-Joining: HALQA, AZ, IT, AKSA, ASH

And the correct data in ArabicShaping.txt:

   0847; MANDAIC IT; R; No_Joining_Group
   084F; MANDAIC IN; D; No_Joining_Group

Please advise what I can do to help correct this in some future version
of the Unicode standard.

Regards,
Benjamin Riefenstahl

--------------

Curiously I am having a hard time finding clear references.  There are
some Mandaic texts online where we can find examples, but I cannot find
a reliable theoretical discussion of the script at the level of detail
that I would wish for.  There is the "Mandäische Grammatik" by Theodor
Nöldeke, from 1875 (see
https://archive.org/details/mandischegramma01nlgoog), which has a note
to his table of the characters, but that note seems incomplete, it
reads:

  <zain>, <het>, <yod>, <shin> werden nicht nach links verbunden.

The note quotes the characters in Hebrew letters.  It leaves out the
aleph (halqa) which also belongs into this group.

Regards,
Benjamin Riefenstahl

Some info from Rick McGowan:

The tables in the latest Core spec draft show:
    Table 9-19 contains "IN" as a dual joining letter.
    Table 9-20 contains "IN" as a right joining letter.

So, the English gloss "IN" appears in two different type tables.
That's one problem.

To help unravel, see Roozbeh's doc for Mandaic here:
http://www.unicode.org/L2/L2010/10413-mandaic-joining-type.pdf
and the original proposal here:
http://www.unicode.org/L2/L2008/08270r-n3485r-mandaic.pdf

The row of table 9-19 which is *labelled* "IN" should actually be "IT" --
at least according to the proposal. The shape looks like it, to me.

Date/Time: Thu Jul 10 12:09:22 CDT 2014
Name: Christian Lerch
Report Type: Error Report
Opt Subject: Coding error for age property in UCD

At least in versions 6.3.0 and 7.0.0 (haven't checked others) there is an inconsistent 
coding of the age property value of "Unassigned" in either the ucd file 
PropertyValueAliases.txt or in the ucdxml xml files.
In the former the abbreviated name (2nd field) for value "Unassigned" is given as "NA".
In the later, however, instead of having age="NA" you find age="unassigned", which has 
no entry in PropertyValueAliases.txt

Date/Time: Tue Jul 22 10:18:15 CDT 2014
Name: Andrew West
Report Type: Error Report
Opt Subject: U+2220 ANGLE and U+299F ACUTE ANGLE

Note: This has already been done by the editorial committee.

Suggest adding a cross-reference between the following pair of characters with similar 
meanings and very similar glyphs:

U+2220 ANGLE
U+299F ACUTE ANGLE

Also may be a good idea to add to confusables.txt.

Date/Time: Mon Jul 28 08:53:40 CDT 2014
Name: William Overington
Report Type: Other Question, Problem, or Feedback
Opt Subject: Regarding the working draft version of Unicode Technical Report #51 dated 2014-07-24, Section 6.

Regarding the working draft version of Unicode Technical Report #51 dated 2014-07-24, Section 6.
 
I suggest that the following text be substituted by the text that follows it.
 
quote
 
There is one further kind of annotation, called a TTS name, for text-to-speech
processing. For accessibility when reading text, it is useful to have a short,
descriptive name for an emoji character. A Unicode character name can often
serve as a basis for this, but its requirements for name uniqueness often ends
up with names that are overly long, such as black right-pointing double
triangle with vertical bar for ⏯. TTS names are also outside the current scope
of this document.
 
end quote
 
The following is the text that I suggest be substituted in place of the above
text, based upon the text from document L2/14-093 and from the draft dated
2014-07-24, though also including some of my own thoughts.
 
new text starts
 
There is one further kind of label, called a Localization Label. A
Localization Label could be used for producing a text-to-speech facility or
for expressing the meaning of a symbol in natural language, which could be
helpful for an abstract symbol such as "Do not tumble dry".
 
For accessibility when reading text, it is useful to have a short, descriptive
name for an emoji character. A Unicode character name can often serve as a
basis for this, but its requirements for name uniqueness often ends up with
names that are overly long, such as black right-pointing double triangle with
vertical bar for ⏯.
 
Please note that Localization Labels need to be in each user’s language to be
useful. They cannot simply be a translation of an English label, since
different words, or even different categorizations, may be what is expected in
different languages. The terms given in the data files here have been
collected from different sources. They are only initial suggestions, not
expected to be complete, and only in English.
 
Apart from mentioning the concept here, Localization Labels are outside of the
scope of this document.
 
new text ends

It may be that you will choose to refine that text further: I feel that it is
important that reference to localization is conserved. Unicode can be used to
typeset many languages and so reference to localization seems very relevant.
 
I declare an interest in that I have been for some years researching
communication through the language barrier using encoded localizable sentences
and as part of my research I have, experimentally, designed symbols for
various sentences. The symbols are mostly abstract rather than pictographic,
though there are a few pictographic elements within some of the symbols, such
as, for example, a stylized snowflake in some of the sentences that are about
the weather. So Localization Labels becoming part of Unicode would help my
research.
 
Certainly, Localization Labels would help my research, however there are also
many abstract symbols, such as "Do not tumble dry" and "Do not dry clean"
where the facility of a Localization Label could be of advantage to a person
who has not met the symbol previously.
 
Perhaps I should mention that in England, where the weather is very
changeable, often from day to day, talking about the weather is part of the
culture: it is topical, sociable and not controversial.
 
Here are just a few sentences for which I have produced symbols.
 
Yes.
 
No.
 
Good day.
 
The following question has been asked.
 
My answer is as follows.
 
I need more information in order to be able to answer.
 
It is snowing.
 
It is summer.
 
Where is a pharmacy please?
 
Where can I buy a vegan meal with no gluten-containing ingredients in it please?
 
Information Desk
 
Sculpture Gallery
 
Is there any information about the following person please?
 
The enquirer is the brother of the first person that was named.
 
The person is safe.
 
The last three sentences in the above list are from a collection of sentences
designed to help find information about a relative or friend after a disaster.
 
At present my research is by using a markup sequence to encode each sentence,
thereby increasing interoperability by avoiding using a Private Use Area
encoding: a symbol can be displayed using a special OpenType font. Certainly I
would like, indeed prefer, to be able to decode automatically directly from
markup to natural language, yet that will require new software to be written.
Decoding to a symbol using an OpenType font is something that I can do now as
I am able to make an OpenType font for the purpose using an existing
fontmaking program and then use the font in an existing desktop publishing
package..
 
Yet as emoji are developed, maybe sentences will become encoded in emoji sets,
as abstract symbols, whereupon having the feature of Localization Labels
already established in Unicode would be of advantage for interoperability.
 
William Overington
 
28 July 2014

Date/Time: Sun Aug 3 20:03:37 CDT 2014
Name: John Cowan
Report Type: Public Review Issue
Opt Subject:

This is a comment on L2/14-187, "Cherokee casing decision may break identifier syntax"

I think we have to take into account that Cherokee may not be the last script that 
becomes problematic.  History shows that when unicameral scripts become bicameral, 
the older forms tend to become the upper case.  This is true of Latin, Greek, and 
Cyrillic at least, even if many modern Cyrillic lowercase forms tend to resemble 
their uppercase prototypes.

My personal view is that using casing distinctions in this way is a Bad Thing, 
because unicameral scripts cannot be accommodated.  But it's already used in Haskell 
and Go, and maybe elsewhere.