L2/07-033

Comments on Public Review Issues
(November 7, 2006 - January 31, 2007)

The sections below contain comments received on the open Public Review Issues as of January 31, 2007, since the previous cumulative document was issued prior to UTC #109 (November 2006).

Contents:

75 Proposed Update UTR #25, Unicode Support for Mathematics
96 Allowing Joiner Characters in Identifiers
97 Proposed Draft UTR #38, A User's Guide to the Unihan Database
99 Proposed Draft UTR #33, Unicode Conformance Model
100 Giving U+00B7 MIDDLE DOT the ID_Continue Property
101 Proposal to Encode an External Link Sign
102 Proposed Update to UAX #15: Unicode Normalization Forms
103 Proposed Update to UAX #29: Text Boundaries
104 Proposed Update to UAX #31: Identifier and Pattern Syntax

Other Reports


75 Proposed Update UTR #25, Unicode Support for Mathematics

No feedback was received via the reporting form this period.

96 Allowing Joiner Characters in Identifiers

Date: Fri, 5 Jan 2007 14:43:08 -0800
From: "Cibu C J" cibucj@gmail.com

The relevant text is below:

B. ZWJ in the following contexts:In a conjunct context.
That is, a sequence of the form:
    * An Letter, followed by zero or more combining marks, followed by
a Virama, followed by a ZWJ, followed by zero or more combining marks,
followed by an Letter.
    * As a regular expression:
      /$L $M* $V ZWJ $M* $L/
      where:
          o $L = [:General_Category=Letter:]
          o $M = [:General_Category=Mark:]
          o $V = [:Canonical_Combining_Class=Virama:]

This will not include the cases of Chillu letter being at the end of a word. So B1 regular expression should be more inclusive and be: /$L $M* $V ZWJ $M*/

BTW, I don't know about any combining markers in Malayalam. Does more than zero $M make sense in case of Malayalam? I agree this is a general regular expression and may be applicable in other scripts. I was just wondering which are they.

Thanks Cibu

97 Proposed Draft UTR #38, A User's Guide to the Unihan Database

Date/Time: Fri Jan 5 12:51:05 CST 2007
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Public Review Issue
Opt Subject: 97 Proposed Draft UTR #38, A User's Guide to the Unihan Database

Under kCantonese, the link to the jyutping writeup is broken.

The kFourCornerCode writeup is indented too much.

The explanations of final digits in the various dictionary indices do not make clear what is done with virtual characters that fall between pages: are they assigned to the preceding page (as I suppose) or the following page? I suggest using the language "following the Nth character".

In kSBCY, spell out "fǎnqiè" for those of us who can't read ideographs.

In kSemanticVariants, "and may be subdivided itself into substrings by commas, each of which may be divided into two pieces by a colon" is poorly worded; it sounds as if each comma may be divided by a colon (!). Reword "subdivided itself by commas into substrings", or preferably more extensively for clarity.

In kXerox, I take "Xerox" to refer to XCSC, but this should be clarified.

I suggest a very brief explanation of what the dropped fields meant and what should be used instead.

99 Proposed Draft UTR #33, Unicode Conformance Model

No feedback was received via the reporting form this period.

100 Giving U+00B7 MIDDLE DOT the ID_Continue Property

Date/Time: Tue Jan 30 12:58:18 CST 2007
Contact: Antoine10646@Leca-Marti.org
Name: Antoine Leca
Report Type: Public Review Issue
Opt Subject: Giving U+00B7 MIDDLE DOT the ID_Continue Property

Dear Sirs:

Please do so.

The change is as simple as giving U+00B7 the property Other_ID_Continue. Changing the "Po" category might also be a possible and effective solution, however it is probably much less easy to enact; still, I consider it should be done if at all possible.

This change have already been done for the ISO standard of one of the main programming languages, namely ISO/CEI 1989:2003 (COBOL).

This character is (despite the "legacy" qualification used in UAX31) the one which is used to encode the orthographical feature invented by Pompeu Fabra at the beginning of the 20th century (concurrently with U+002E ., but it is clear to anybody that this latter cannot be used for identifiers; furthermore, using U+002E is seen as inferior, and can readilly be qualified of "legacy"; also about "legacy", the alternative, i.e. using U+013F Ŀ and U+0140 ŀ have been dropped in Spain around 1980, when the usage and the typewriter keyboard layouts evolved from having occasionally a Ŀ key (usually in the lower right corner), to the present situation (acknowledged in the naming list of the Unicode Standard, with the compatibility mappings of 013F and 0140) where essentially all typing keyboards in Spain (the layout is different as the one used in Latin America) show the "punt mig" or "punt volat" (flying dot) in the shift position of the 3 key.

Here it is important to remember that Spain has more than 40 millions inhabitants, with a high life standard which means computers are very common; and Catalan is the official language of about 1/4th of Spain (Catalonia, Valencia and Balears), where it is taught and used in business and legal affairs.

When I asked ten years ago (while preparing C99) about the reason for this exception, the only reason I was told was that for Americans, · means the multiplicative operator, and for this reason this character should be avoided (I saw it about the same as prohibiting Ø because it could be interpreted as the empty set, however I am a young European engineer without university diploma so I could not argue at the time). I now read UAX31 addresses this issue and recommend firmly the use of U+2219 or U+22C5 for this mathematical use, so this should not an issue any more.

The handling of this character has been historically chaotic (for example, when Catalonia enabled the use of Catalan for the Civil register, a bug prevented the registration of names with ·, despite it being used frequently with first names). These kind of stories, along with the ones with Ñ also quite frequent here in Spain, are often a base to bash the poorly internationalized applications sold in foreign countries. It should not be the same for Unicode, at any rate if it is possible to avoid it; and here it seems pretty clear that the easiest and cleaner move is to make the U+00B7 character a possible ID_Continue.

Best regards,

-- Antoine Leca
Corbera (Valencia, Spain)

101 Proposal to Encode an External Link Sign

See also the text of the PRI. No feedback was received via the reporting form this period.

102 Proposed Update to UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

103 Proposed Update to UAX #29: Text Boundaries

No feedback was received via the reporting form this period.

104 Proposed Update to UAX #31: Identifier and Pattern Syntax

No feedback was received via the reporting form this period.


Other Reports

Date/Time: Mon Nov 13 14:25:02 CST 2006
Contact: lorna_priest@sil.org
Name: Lorna Priest
Report Type: Error Report
Opt Subject: shape of 027F in the charts

In the Unicode book U+027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK and U+027E LATIN SMALL LETTER R WITH FISHHOOK are the same height and shape, just mirrored. In Chinese publications (practically the only place where you ever see these symbols) the stem of U+027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK invariably extends below the baseline.

I will be happy to provide samples if that is necessary.

Lorna

Date/Time: Fri Dec 15 12:29:45 CST 2006
Contact: paivakil@gmail.com
Name: Mahesh T. Pai
Report Type: Error Report
Opt Subject: Error in description of U+0D4D

Malayalam codechart (U0D00.pdf) describes the character U+0D4D as = vowel half-u.

However, this is not correct. Certain people represent pure consonants appearing at word endings as half-a, while another group prefers to call it the half-u. The school of thought which describes this semi-pute consonant as half-a uses the (consonant + chandrakkala) to refer to this half a.

But the other school, which calls this the `samvruthokaram' (samvrith + u karam) uses the (consonant + ukar/U+0D41 + U+0D4D to represent this sequence.

The differences between the two schools of writing are irreconcilable, and have existed for several centuries.

Please 2.2 of L2 05/372 submitted by me and the authorities cited therein.

There is another problem to use of chandrakkala to represent half a or half u; because it represents the pure, vowelless consonant. The samvruthokaram, whether defined as half u or half a, is a semi pure consonant. This, IMHO, violates the `one sequence, one linguistic value' principle.

While people who prefer to refer the samvruthokaram as half a will use the explicit virama form of a consonant, under no circumstances should a (consonant + chandrakkala) sequence refer to a half u form. Even under the present encoding scheme, the half u form can be represented using (consonant + vowel sign u + chandrakkala).

The present description of U+0D4D is unnecessarily fettering a particular school of thought which prefers to represent samvruthokaram as (consonant + vowel sign u + chandrakkala).

Date/Time: Mon Jan 1 17:45:26 CST 2007
Contact: mike@fuhr.org
Name: Michael Fuhr
Report Type: Error Report
Opt Subject: Glossary definition of Trailing Consonant

In the Glossary of The Unicode Standard, Version 5.0, on p. 1147, one of the definitions for Trailing Consonant is "(1) In Korean, a jamo character with the Hangul_Syllable_Type property value Vowel_Jamo...." Per the referenced definition D113 (p. 118) and per HangulSyllableType.txt for the indicated range (U+11A8..U+11F9), shouldn't "Vowel_Jamo" be "Trailing_Jamo"?

Date/Time: Mon Jan 15 14:46:04 CST 2007
Contact: now@bitwi.se
Name: Nikolai Weibull
Report Type: Other Question, Problem, or Feedback
Opt Subject: Interpretation of casing of I/i in Lithuanian locale

Hi!

In SpecialCasing.txt we have the following:

# Lithuanian retains the dot in a lowercase i when followed by accents.

# Remove DOT ABOVE after "i" with upper or titlecase

0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE

# Introduce an explicit dot above when lowercasing capital I's and J's # whenever there are more accents above. # (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)

0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I

As I see it, lowercase(0049 0307) = 0069 0307 0307, even though that may seem rather weird. It may be that 0307 isn't to be considered to be part of More_Above in this particular instance, if one is to interpret the parenthesis as referring to the first statement. However, if this is the case, then More_Above is a poor choice for a condition and should really have it's very own condition with a proper name, e.g., More_Lt_Accents_Above.

My interpretation is that lowercase(0049 0307) = 0069 0307 0307, but it seems that ICU disagrees with me.

Please clarify what lowercase(0049 0307) in the lt locale should be.

Thanks!

nikolai

Date/Time: Thu Feb 1 15:07:18 CST 2007
Contact: jhall@adobe.com
Name: Jerry Hall
Report Type: Error Report
Opt Subject: UAX#14

Source: The Unicode 5.0 Standard (Hardcover Book)

1. p1289, last paragraph in section X3.1; first sentence contains the phrase "line break takes" which doesn't make sense. I believe the word "takes" should be deleted.

2. p1315, rule LB4; the text doesn't match the regular expression. I'm puzzled why rules LB4 and LB5 aren't combined.

3. p1322, enum break_action; member EXPLICTI_BRK is incorrectly spelled.

4. p1323, for loop condition; text && should be &&.

5. p1323, comment // if context is A SP * B; I believe the asterisk symbol (*) is supposed to mark the position in the context, but since it is appearing in a regular expression it could be construed as meaning: zero or more occurrences of SP.

6. p1322, p1323, p1324, p1325, p1326, code segments; the presentation of code in separate segments on different pages makes it hard to follow the algorithm. Moreover, it is not clear if the code in section 7.7 should just replace the comment referencing it on p1323, or whether it replaces more lines. I believe it replaces more lines including the for loop.

7. p1323, the findLineBreak function; it appears this function will fail if the first element of pcls contains SP and the second element contains a class that is handled by the brkPairs table. What happens is that the code will eventually attempt to do the lookup in the brkPairs table with cls set to SP. Since SP isn't a valid index into the table the access will be invalid and could return the wrong value or cause the program to fault. (I verified this with code I downloaded (linebrk.cpp) from the Unicode website.)

8. p1327, paragraph before section X8.2; the reference [Cedar97] doesn't appear in UAX#41 on p1411.