Comments on Public Review Issues (January 31, 2008 - May 6, 2008)

The sections below contain comments received on the open Public Review Issues as of May 6, 2008, since the previous cumulative document was issued prior to UTC #114 (February 2008).

Contents:

111 Proposed Update to UTS #18 Unicode Regular Expressions
119 Proposed Update to UTR #25 Unicode Support for Mathematics
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues (mainly late 5.1 issues)

111 Proposed Update to UTS #18 Unicode Regular Expressions

Date/Time: Thu Feb 28 18:14:37 CST 2008
Name: Philippe Verdy
Report Type: Technical Report or Tech Note issues
Subject: TR18-12 §RL1.3 Requesting a lower priority of the *explicit* union operator "||"

It currently says: "Union binds more closely than intersection, set difference, or symmetric difference. Otherwise items bind from the left. (However, such binding or precedence may vary by regular expression engine. This precedence is also different than the normally-understood precedence between the corresponding mathematical operators.)"

While this is true for the most common usage of the implicit union operator (i.e. no operator at all), I still think that the explicit union operator "||" (or "∪") should still respect the usual mathematical priority order, so that it binds more loosely than intersection with "&&" (or "∩"). Otherwise the explicit union operator should not be needed and should not be used.

The interest of allowing explicit union operators would be to avoid extra grouping within brackets, for easier readability and this readability should be more natural by keeping the mathematical priority conventions for the explicit union operator, without changing the priority of the implicit operator used traditionally in level-1 Unicode sets.

I propose:

ITEM     := "[" ITEM "]"  // for grouping
OPERATOR := ""            // (implicit) no separator = union (highest priority)
:= "&&"          // intersection: A∩B (high priority)
:= "--"          // set difference: A-B (middle priority)
:= "~~"          // symmetric difference: A⊖B (middle priority)
:= "||"          // (explicit) union: A∪B (lowest priority)

(the presentation of operators is then sorted from more closely to more loosely binding) And then saying that explicit union binds more loosely than intersection of set difference or symmetric differences.

So:

* [a && b || c && d] would be the same as [[a && b] [c && d]]
* [ab && c -- d ~~ e || f] would be the same as: [[[[[a || b] && c] -- d] ~~ e] || f]
* [a || b ~~ c -- d && ef] would be the same as: [a || [b ~~ [c -- [d && [e || f]]]]]

I would also say that the explicit union operator is not required in level 1 implementation, but would just be an optional convenience.

I note also that the example:

* [A--B--C--D&&E] is the same as [[[[A--B]--C]--D]&&E]

is described as: "That is, take A, then remove all Bs, then all Cs, then all Ds, and intersect the result with E."

I would still prefer keeping the mathematical meaning, so that D&&E would still group more closely than the set difference, so I would make:

* [A--B--C--D&&E] is the same as [[[A--B]--C]--[D&&E]]

That is, take A, then remove all Bs, then all Cs, then remove all intersections of D with E.

I don't think that set difference (--) and symmetric difference (~~) should have different priorities, so they would bind left-to-right independantly, but still with higher priority than the explicit union (||).

This would minimize the number of brackets needed for grouping.

Date/Time: Tue Mar 18 07:26:26 CST 2008
Contact: joris.van.der.geer@oracle.com
Name: Joris van der Geer
Report Type: Public Review Issue
Subject: Proposed Update Unicode Technical Standard #18

I have a remark regarding the "Proposed Update Unicode Technical Standard #18" : Unicode Regular Expressions.

Chapter 2.4, Default Loose Matches describes support for case-insensitive matching. It may be obvious for the reader that case insensitivity applies only to scripts having case distinction, albeit it is not stated as such.

Other scripts have variation in different aspects. The Chinese script comes in two variants: Taipei is spelled either 臺北 (traditional) or Chinese: 台北 (simplified).

I wonder if users of string matching and regular expression matching experience this variation similar to case disinction. A search string can specify only one form, normalization will not change it.

Given the global scope of Unicode, would it be natural to have loose matching regarding simplified versus traditional ?

Regards,

Joris van der Geer
Software developer
(real-time pattern matching)
Oracle Netherlands

119 Proposed Update to UTR #25 Unicode Support for Mathematics

Date/Time: Fri May 2 05:37:36 CDT 2008
Contact: dominikus@scherkl.de
Name: Dominikus Scherkl
Report Type: Public Review Issue
Subject: 119 Invisible plus

The invisible plus operator is already encoded in unicode 5.1, so the text should be changed accordingly (e.g. include the codepoint)

Other Reports

Date/Time: Fri Mar 7 10:02:24 CST 2008
Contact: sven.siegmund@gmail.com
Name: Sven Siegmund
Report Type: Error Report
Subject: Latin small letter D with palatal hook 1D81

The Latin small letter D with palatal hook (Code point 1D81) got misformed in the font you use in the PDF for Phonetic Extensions Supplement.

Date/Time: Fri Mar 7 11:18:29 CST 2008
Contact: Henri@Solages.org
Name: Henri de Solages
Report Type: Error Report
Subject: Mongol-Uighur capital and final letters

Ulaanbaatar.

Hello.

I need Mongol-Uighur script (U1800 to U1842) in Unicode, and I was wondering why nobody seem to use Unicode for Mongol-Uighur script, so I was going to try to make such a font within the "DejaVu" font project.

But there is a major problem. If western scripts have 2 cases (capital and small letters), the Mongol-Uighur script has 3 cases: capital, central and final cases, even if, for some letters (ts, ch), the capital and central forms are identical and, for very few letters, central and final forms are identical. Capital letters are used at the beginning of each word, final ones, at the end of each word, and central ones are the ordinary small letters. Moreover a few letters (a/e) have 2 different final forms, which usually cannot be exchanged (some words use one form, some the other one). On the other hand, the description of the Mongol-Uighur letters from the present pronunciation (often corresponding with the Cyrillic alphabet), though traditional, is not completely appropriate, since some "letters" (o and u, ö and ü) only differ in their present pronunciation and Cyrillic transliteration, but share exactly the same triplet of glyphs. (Their pronunciations are not very different, so that there are sometimes an hesitation in the Cyrillic transliteration. Were they one sound when the script was decided in the 13th century?). U1823 and U1824 are one letter, as well as U1825 and U182. Some letters are only different in their capital forms (a and e, g and kh, o/u and ö/ü. I note "kh" what your chart notes "qa".), not in the other forms, so that using 6 glyphs instead of 4 would be a waste. On the other hand, some letters considered in Cyrillic as one are 2 letters in Mongolian-Uighur (male kh and female kh, male g and female g), with no common glyph either in capital, central nor final forms. And they have clearly different pronunciations. U182C and U182D should be split into 2. There are also a few problem of syllables, the "n" and the "g" taking dots or not according to the following letter, which may be regarded as a (compulsory) ligature question, but needs other glyphs. Another difficulty is the rare cases of one-vowel words, where the glyph may have characteristics of a capital plus characteristics of a end letter, making a 4th glyph. For instance, there are three words of one letter transliterated as "aa", with 2 different spelling (according to the meaning), whose at least one is a capital-final a. Some of these forms are present in your chart, though they are not the ordinary "small" letters (for instance U1820 to U1822) as the chart pretends.

However, for the moment, nearly all your letters are described as "small", even if some of the glyphs presented as examples in you PDF file are clearly capital letters (U1823, U1826, U1828 for instance). Or is there somewhere else the missing glyphs?

So I suggest that you add as many glyphs as required (and no more) to describe the Mongolian-Uighur script, adding "capital letters" and "central letters" glyphs, as well as, when necessary, one-vowel word glyph. Maybe you can decide 2 different numbers for identical glyphs of very different meaning (final n and final a/e)? Another approach, which is not the one of Unicode, would have been not to describe letters, but letter parts, which would be economical because often the capital letter only differs from the central letter by the addition of a "titim" (diadem). This approach is indeed use to teach the children prior to the teaching of letters themselves. But it's also true for latin and Cyrillic alphabets (In many methods, six year old children begin by drawing loops, teeth etc., then letters with such components.).

Before fixing this big problem, you should at least replace the false "small" letter graphics of your chart by ordinary central letters.

As far as I'm concerned what should I do? Drawing only the small letters for DejaVu?

Yours sincerely.

Date/Time: Fri Mar 21 06:15:49 CST 2008
Contact: program.spe@home.pl
Name: Christopher Yeleighton
Report Type: Error Report
Subject: Inconsistent links to Roman numerals

LATIN CAPITAL LETTER I is linked to ROMAN NUMERAL ONE but LATIN CAPITAL LETTER V is not linked to ROMAN NUMERAL FIVE. Why?

Feedback on Encoding Proposals

Date/Time: Mon Apr 14 18:16:30 CDT 2008
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Subject: L2/08-140

I suggest changing the block name from "Rumi Symbols" to "Rumi Numbers".

Date/Time: Mon May 5 17:45:38 CDT 2008
Contact: billposer@alum.mit.edu
Name: Bill Poser
Report Type: Feedback on an Encoding Proposal

In the proposal for the addition of characters to the Canadian Aboriginal Syllabics dated 2008-04-12 (N3437, L2/08-149), several characters have names that inexplicably include "Carrier". These are:

None of these characters is or has been used for Carrier. Carrier writes /Cai/ as Ca-i. Moreover, the name "Canadian Syllabics Carrier GU" is already assigned to U+15EF. Note that I am only questioning the names. The characters themselves are, I believe, used for other languages and if so should be added. It is just the Carrier connection that is problematic.

Date/Time: Tue May 6 08:19:52 CDT 2008
Contact: carrier@worldonline.dk
Name: Jesper Willemoes Hansen
Report Type: Feedback on an Encoding Proposal
Subject:

Hello!

I'm very happy to see that Unicode 5.1 now includes mahjong tiles and I look forward to the updates to fonts so that they may include these new characters in unicode. However the red fives of the japaneze mahjong are missing from the unicode. I realize that this is a color change, but they are a seperate tile type that should be able to be distingushed despite copying from one document type to another, so that they are not changed during this proces. This could change the meaning of what is written.

I've included a couple of links to illustrate the missing fives.

- Kind regards, Jesper Willemoes Hansen

Closed Public Review Issues

Date/Time: Sun Mar 2 03:00:56 CST 2008
Contact: charles@agenoria.fsnet.co.uk
Name: Charles Cox
Report Type: Error Report
Subject: U+2E2F VERTICAL TILDE

Unicode Standard Annex #31 states that Identifier Characters and Pattern_Syntax Characters form disjoint sets.

However, in PropList-5.1.0d25.txt, U+2E2F VERTICAL TILDE is listed as having the property Pattern_Syntax, and in DerivedCoreProperties-5.1.0d25.txt the same codepoint is listed as having the properties ID_Start, ID_Continue, XID_Start and XID_Continue.

If I have understood Unicode Standard Annex #31 correctly, this appears to be an error.

Charles Cox

Date/Time: Mon Mar 3 10:16:10 CST 2008
Contact: alan.wood@justis.com
Name: Alan Wood
Report Type: Error Report
Subject: DerivedAge-5.1.0d12.txt

In this file:

I think the year in this line:

# Newly assigned in Unicode 5.1.0 (expected March, 2006)

should be 2008, not 2006.

Date/Time: Thu Mar 6 03:25:30 CST 2008
Contact: mark.dalley@somersetpct.nhs.uk
Name: Mark Dalley
Report Type: Technical Report or Tech Note issues
Subject: Typo in TR29-12 draft section 3

I refer to the following snippet from section 3 of the above draft:
....start >>>>>>>>>>>>>>>>>>> Grapheme Cluster Boundary Rules

The same rules are used for both default grapheme clusters and extended default grapheme clusters, with one exception. The latter adds rule 9b, while the former omits it.
....end >>>>>>>>>>>>>>>>>>>

I presume from reading the text of the GB rules that the above text is referring to rule 9a.

Regards

Mark Dalley Somerset PCT

Date/Time: Fri Mar 7 01:17:55 CST 2008
Contact: duerst@it.aoyama.ac.jp
Name: Dürst Martin
Report Type: Technical Report or Tech Note issues
Subject: Clarification in Bidi UAX

I propose to add more examples in http://www.unicode.org/reports/tr9/#N1. Currently, there are the following examples:

R  N  R  → R  R  R
L  N  L  → L  L  L
R  N  AN → R  R  AN
AN N  R  → AN R  R
R  N  EN → R  R  EN
EN N  R  → EN R  R

I propose to add one or more of the following examples:

EN N  EN  → EN R  EN
EN N  AN  → EN R  AN
AN N  EN  → AN R  EN
AN N  AN  → AN R  AN

I am proposing this because I made a mistake when implementing the bidi algorithm for http://www.w3.org/International/iri-edit/BidiExamples.html. I wasn't sure whether these cases were included, and took the wrong turn. Adding one or more of these examples should help future implementers avoid this mistake.

Regards, Martin.

Date/Time: Mon Mar 17 18:18:58 CST 2008
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Error Report
Subject: UCD in XML for Unicode 5.1.0

A couple of errors that I have noticed in ucd.nounihan.flat.xml.

Also, will the next version of TR42 define an attribute corresponding to the formal aliases listed in NameAliases.txt? The omission of this information seems to be the one major shortcoming of the XML files.

1. <standardized-variants> This section omits the following entries (i.e. where there are multiple variants of the same base character plus VS depending upon position):

<standardized-variant cps="182C 180B" desc="second form" when="initial medial"/>
<standardized-variant cps="182D 180B" desc="second form" when="initial medial"/>
<standardized-variant cps="1874 180B" desc="second form" when="medial"/>
<standardized-variant cps="1874 180C" desc="feminine first medial form" when="medial"/>

2. <normalization-corrections> The order of entries appears to be incorrectly sorted, so that the entry for F951 is last, rather than first as in NormalizationCorrections.txt.

<normalization-corrections>
<normalization-correction cp="2F868" old="2136A" new="36FC" version="4.0.0"/>
<normalization-correction cp="2F874" old="5F33" new="5F53" version="4.0.0"/>
<normalization-correction cp="2F91F" old="43AB" new="243AB" version="4.0.0"/>
<normalization-correction cp="2F95F" old="7AAE" new="7AEE" version="4.0.0"/>
<normalization-correction cp="2F9BF" old="4D57" new="45D7" version="4.0.0"/>
<normalization-correction cp="F951" old="96FB" new="964B" version="3.2.0"/>
</normalization-corrections>

Date/Time: Wed Mar 19 18:40:43 CST 2008
Contact: daniel.buenzli@erratique.ch
Name: Daniel Bünzli
Report Type: Technical Report or Tech Note issues
Subject: UTR #42

Is there any particular reason for the attributes representing code points or code point sequences in the elements 'block', 'named-sequence', 'normalization-correction' and 'standardized-variants' to be of type 'text' instead of the data types defined for code points in 2.3 ?

Best,

Daniel Bünzli