L2/11-023

Comments on Public Review Issues
(October 28 2010 - February 3, 2011

The sections below contain comments received on the open Public Review Issues and other feedback as of February 2, 2010, since the previous cumulative document was issued prior to UTC #125 (November 2010).

Contents:

174 Proposed Draft UTR #49: Unicode Character Categories
177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports: Indic
Other Reports: Han
Other Reports: Misc


174 Proposed Draft UTR #49: Unicode Character Categories

Date/Time: Wed Dec 8 23:46:30 CST 2010
Contact: cewcathar@hotmail.com
Name:
Report Type: Public Review Issue
Opt Subject: Also data file for tr 49

Hi. Regarding http://www.unicode.org/reports/tr49/tr49-1.html

and particularly the data file at: http://www.unicode.org/reports/tr49/Categories.txt

I was going to wait till I had all my comments together but as I started going through data a couple of character classifications struck me right away:

00A2 Sc [Punctuation] [X] [X] [X] CENT SIGN

{ hmm why is this not classified as currency too like the pound sign? is this an error or am I confused? }

007E Sm [Symbol] [Math] [X] [X] TILDE

{ you list this as a math symbol but it also looks like the diacritic on Spanish palatal n (pronounced "enye") written as an "n" with a tilde; there is another combining tilde of course; but combining or not isn't a diacritic a diacritic regardless? }

Thanks.

Best wishes to you all for the Holidays enjoy your San Francisco rain (if that is where you are) as much as we are enjoying our cold here in Florida. And sorry to be a bug -- I will try to get to the whole database eventually but I obviously am someone who is still learning the code charts.

Sincerely,

--C. E. Whitehead
cewcathar@hotmail.com

Date/Time: Thu Dec 9 00:10:07 CST 2010
Contact: cewcathar@hotmail.com
Name:
Report Type: Public Review Issue
Opt Subject: Also data file for tr 49

Hi. Once more regarding http://www.unicode.org/reports/tr49/tr49-1.html

and data file at: http://www.unicode.org/reports/tr49/Categories.txt

Here are all three symbols I want to comment on for now (I added the exclamation point which I use sometimes as a mathematical/logical symbol for not and not just as punctuation just as tilde means more than just "similar" to me):

00A2 Sc [Punctuation] [X] [X] [X] CENT SIGN

{ hmm why is this not classified as currency too like the pound sign? is this an error or am I confused? }

007E Sm [Symbol] [Math] [X] [X] TILDE

{ you list this as a math symbol but it also looks like the diacritic on Spanish palatal n (pronounced "enye") written as an "n" with a tilde; there is another combining tilde of course; but combining or not isn't a diacritic a diacritic & so maybe there's a need for multiple categorization schemes with diff categories for a single character; see also below my comment on the exclamation point, often also used to mean not }

0021 Po [Punctuation] [X] [X] [X] EXCLAMATION MARK

{ perhaps this category could be further divided: period/full stop, comma, semi colon, colon for Western punctuation; quotations are separate and set off quotations; not sure where dashes and parentheses should go in terms of further subcategorization -- but this is also a logical not and so a mathematical symbol in some sense as much as tilde so again the issue: is there going to be more than one set of categories -- I know you've left your database open to revision but are you going to have a second database with a slightly different categorization scheme side-by-side with this one in case someone wants alternatives? sorry to be a pain like the list is sometimes but this sort of stuff will drive you insane trying to get it right but this is it for the evening as my laptop battery is full now } Best,

--C. E. Whitehead
cewcathar@hotmail.com

Date/Time: Sat Dec 18 13:08:36 CST 2010
Contact: cewcathar@hotmail.com
Name:
Report Type: Public Review Issue
Opt Subject: tr 49 data file

Again regarding http://www.unicode.org/reports/tr49/tr49-1.html

and regarding the date file: http://www.unicode.org/reports/tr49/Categories.txt

I am not sure the category number symbol. I think you need something more specific (decimal separators should get their own category and I'd be happy to collect all these for you -- the commas and the full stops and then the actual decimal separators -- you have only 3 of the latter that I can find).

Nor am I sure about the category "star, asterisk, snowflake" which again could be three categories -- and then I think none of these three categories apply to sparkles and florettes:

2728 So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] SPARKLES
273E So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] SIX PETALLED BLACK AND WHITE FLORETTE
273F So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] BLACK FLORETTE
2740 So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] WHITE FLORETTE
2741 So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] EIGHT PETALLED OUTLINED BLACK FLORETTE
2747 So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] SPARKLE
2748 So [Symbol] [Dingbat] [Star; asterisk; snowflake] [X] HEAVY SPARKLE

You might want three separate cats for the last and the above might go in something else (florette? sparkle?).

Some categories overlap as I guess I noted previously -- mainly the slashes, asterisks, full stops, and commas, which are both operational math or part of expressions (full stops and commas are used to make math expressions -- also I am not sure about the slash in 3/4 which I do not see as an operator but as used to make an expression).

Also tilde is both a diacritic and a math relational

I've listed all slashes, asterisks, and full stops but not commas and tildes; I can send you these . . . I so you can create overlapping categories if you want; first I just need to check through them and view each character -- some I've still to view!

Then the computer symbols += is an operator so = becomes an operator . . . unless unicode has a separate symbol += (I did not find it; sorry).

Then we have the issue of whether the Arabic vowels should always be labelled as diacritics or as vowel marks (I've always considered them all diacritics of the category vowel and I suppose the same is true for Hebrew)

Finally I have a question about
2140 Sm [Symbol] [Math] [Operator] [N-ary] DOUBLE-STRUCK N-ARY SUMMATION
how come this is an operator and an integral sum is not ???

Best,

--C. E. Whitehead
cewcathar@hotmail.com

Date/Time: Mon Jan 31 01:16:14 PST 2011
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI 174 Proposed Draft UTR #49: Unicode Character Categories

I’m not a specialist in those scripts, nethertheless I would like to at least draw attention to a few cases I find doubtful:

-shoudln’t 037A Lm [Diacritic] [X] [X] [X] GREEK YPOGEGRAMMENI be Diacritic > Spacing ?

-AA70 Lm [Letter] [Consonant] [X] [X] MYANMAR MODIFIER LETTER KHAMTI REDUPLICATION

I think it’s a modifier, not a consonant

-AABD Lo [Letter] [Consonant] [Final] [X] TAI VIET VOWEL AN
AABE Mn [Letter] [Consonant] [Final] [X] TAI VIET VOWEL AM Is the name incorrect or the category (vowel versus final consonant) ?

And for these, I think they should probably be more consistent:

-266D So [Symbol] [Music] [X] [X] MUSIC FLAT SIGN
266E So [Symbol] [Music] [X] [X] MUSIC NATURAL SIGN
266F Sm [Symbol] [Music] [X] [X] MUSIC SHARP SIGN
1D12A..1D133 So [Symbol] [Music] [Western] [Accidental] MUSICAL SYMBOL DOUBLE SHARP..MUSICAL SYMBOL QUARTER TONE FLAT

Should be consistent, i.e. 266D-266F should be Western accidental too.

-Should playing cards (1F0A0-1F0DF) be subdivided by suits, like mahjong tiles ?

177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

Date/Time: Mon Dec 13 06:22:50 CST 2010
Contact: stephanfmueller@gmail.com
Name: stephan
Report Type: Error Report
Opt Subject: Typo in IdnaTest.txt?

In the file http://www.unicode.org/Public/idna/6.0.0/IdnaTest.txt, line 21 reads:

# Column 3: toUnicode - the result of applying toUnicode to the source, using the specified type

whereas in our understanding of http://www.unicode.org/reports/tr46/#ToUnicode it should read:

# Column 3: toUnicode - the result of applying toUnicode to the source, using nontransitional

Conversely, line 22 reads:

# Column 4: toASCII - the result of applying toASCII to the source, using nontransitional

whereas in our understanding of http://www.unicode.org/reports/tr46/#ToASCII it should read:

# Column 4: toASCII - the result of applying toASCII to the source, using the specified type

Can you please check.

Thank you,
Stephan

Date/Time: Sat Jan 8 12:34:14 CST 2011
Contact: cewcathar@hotmail.com
Name: CE Whitehead
Report Type: Public Review Issue
Opt Subject: tr 46 IDNA Compatibility Mapping

Hi, I've just briefly glanced at the tr46 data at

http://www.unicode.org/Public/idna/6.0.1/IdnaMappingTable.txt

(and have yet to go through the current version of tr46); however I noticed an inconsistency in the database mapping.

First you have:

U 065F (kashmiri) ; valid # 6.0 ARABIC WAVY HAMZA BELOW

Then you have:

U 0673 (also Kashmiri)

which can of course it seems be mapped to O627 + 065F (sorry to say "it seems;" I know Arabic some; no kashmiri) -- so why are both U 0673 and U 065F valid???

I realize of course that you cannot map:

U 0672 (likewise kashmiri) as you do not have a corresponding wavy hamza above here (it may be in the supplements -- but I could not find it so I don't think so)

So either make U 065F invalid -- probably your best bet without a wavy hamza above but perhaps it is necessary to display the wavy hamza below U 065F alone (but why? -- it is not necessary to display the wavy hamza above alone -- but like I said I do not know kashmiri); or if there is a wavy hamza above encoded somewhere/some way then map U 0672 and U 0673 . . .

(I note that you have 0675 - 0678 mapped to another character plus 0674 so the above should perhaps be mapped . . . for consistency . . .)

Date/Time: Tue Feb 1 14:27:07 PST 2011
Contact: uts46@plan9.de
Name: Marc Lehmann
Report Type: Other Question, Problem, or Feedback
Opt Subject: uts#46 test data IdnaTest.txt

Hi!

I am trying to apply the IdnaTest.txt to my implementation, but there are some issues I found that I either don't understand (I am not an expert) or are problems with that file, specifically I used the 6.0.0 version.

227 TOA ERROR B (a.b.c。d。 => got a.b.c.d)

This refers to this line in IdnaTest.txt, which in my implementation yields "a.b.c.d" in toascii:

B; a.b.c。d。; a.b.c.d.;

This says, according to the comments in that file, that the result of ToASCII should be "a.b.c。d。", because column 4 is empty, which emans either the result should be empty, or the result must match "a.b.c。d。". I don't see how it can be empty, so it must match "a.b.c。d。", which to me looks like a bug, as that shouldn't be the result of ToASCII. Maybe for this (and following) line column 3 and 4 have been swapped?

T; \u200D; [C2];

Here my implementation gives the empty string - I don't know if this is correct, but I feel the file format should differentiate between empty results and same-as-source results - it's too easy to confuse the "column is empty when result == source" (but might be empty for other reasons) statement with "column is empty means result == source".

I have 17 more test failures which are probably failures in my code (and that I have to further check), but the above two issues I couldn't resolve from reading the TR or the file itself.

Sorry for the noise if it's an error on my part.

Encoding Feedback

Date/Time: Sat Oct 30 16:15:34 CDT 2010
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Pau Cin Hau Alphabet (L2/10-080)

I propose that the names be changed to alphabetic-style names. This would mean changing the initial letters from PA, KA, LA, etc. to P, K, L, etc. It also means changing the final letters to FINAL P, FINAL K, FINAL T, FINAL M, FINAL N, FINAL L, FINAL W, FINAL NG, FINAL Y. The proposal itself says this is what the letters mean, so they should be named in that way. The tone letters are numbered, which is very arbitrary: PAU CIN HAU LETTER SHORT VOWEL SENTENCE FINAL SANDHI TONE and the like would be very verbose, but at least clear.

Date/Time: Thu Feb 3 11:01:10 PST 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-033 Arabic Short Vowel Mark

My take on this is that it would be better to encode two letters ARABIC LETTER PALULA-SHINA SHORT E (treating the resemblance to waw as accidental) and ARABIC LETTER PALULA-SHINA SHORT O (treating the resemblance to Farsi-yeh and yeh-baree as accidental). This better fits the Arabic encoding convention that new letters get new characters rather than being represented with base+diacritic combinations.

Closed Issues

Date/Time: Wed Nov 3 02:36:45 CST 2010
Contact: diethelm_kanjahn@sil.org
Name: Didi Kanjahn
Report Type: Public Review Issue
Opt Subject: #176 Properties of Two Khmer Characters

To the best of my knowledge these code points were introduced with the idea that they might possibly be useful at some point in the future, in particular for academics. At this point of time, I am not aware of anybody or any application actually making use of these codepoints.

I hope that this still helps since the issue seems to be still open.

Other Reports: Indic

Date/Time: Mon Nov 1 22:23:20 CST 2010
Contact: barun_sahu@yahoo.com
Name: Barun Kumar Sahu
Report Type: Other Question, Problem, or Feedback
Opt Subject: Bengali character Om (Aum)

Bengali Om (or Aum) is distinct from the character Om in Devanagari, Gujarati, Tamil, Tibetan etc. At present, Bengali Om is written as two characters: U+0993 U+0981 (letter O, chandrabindu). Will it not be better to encode Bengali letter Om as an independent character, inasmuch as it is to be pronounced differently than U+0993 U+0981.

Date/Time: Mon Nov 29 04:34:37 CST 2010
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: 1CD3 VEDIC SIGN NIHSHVASA

I have previously requested that the character 1CD3 VEDIC SIGN NIHSHVASA be annotated as "separates sections between which a pause is disallowed" and accordingly the annotation was added to be present in the Unicode 6.0 code chart.

I hereby additionally request that another annotation be added to this character to recognize a glyphic variant:

"a glyph variant is upright and not oblique"

As substantiation for this request, please refer to the sample given near the top of p 16 of my Grantha proposal L2/09-372. Towards the end of the third line in the Grantha text (before the Tamil-Grantha number 5 on top of HAA) one sees that this character has been presented upright. In Sama Vedic books printed in the Grantha script, this character is always upright and not oblique.

I have also observed this character in upright form in Devanagari texts of the Sama Veda published in Tamil Nadu. Thus it is possible that it is a regional variation (i.e. Tamil Nadu vs North India).

Also observe the following passage from sec 4.5.2 of my Grantha proposal (same page as image mentioned above):

<quote> This superscript double danda is represented by 1CD3 VEDIC SIGN NIHSHVASA, though the representative glyph in the Vedic Extensions block code chart is somewhat oblique rather than upright as in Grantha texts. Whether upright or oblique, the purpose of the character is clear and is one and the same. In Devanagari texts we come across both the upright and oblique forms. To cater to the expectation of Sama Vedic scholars using Grantha, a Grantha font can show an upright glyph. Therefore these are mere stylistic variations and hence not worthy of disunification. </quote>

Given that we have decided not to disunify the character for the glyph change, and given that 097D DEVANAGARI LETTER GLOTTAL STOP has an annotation regarding a major glyphic variant, it is justified to add a similar annotation for a major glyphic variant for with the wording given above.

The wording may be improved, if felt necessary.

Thanks.

Date/Time: Sat Jan 29 06:50:06 PST 2011
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Inappropriate annotation for Malayalam TTTA and NNNA

The characters Malayalam TTTA and NNNA proposed by N3494 L2/08-325 and subsequently encoded have received the annotation "historic use only". In my document L2/09-341 I had requested these be labeled "rare use" characters.

On reflection, neither my previous suggestion nor the current annotation is appropriate. My previous suggestion does not sufficiently outline the actual use case of these characters and in fact even suggests that these are in current use albeit rare. The current annotation directly implies that these were historically used.

In fact apart from one Malayalam grammar text by one Rajaraja Verma which proposes the potential use of these characters and two works by non-native users (Gruenendahl and Pederson) which obviously draw from Verma's work, there is absolutely no attestation for actual usage of these characters. Verma only suggested these but they never got caught in real life. (If they were actually used, the proposers of L2/08-325 would/should have provided such attestation.)

Therefore the only proper characterization of these characters is as "potential pedantic use" which represents the original intent of Verma in devising these and any other annotation would be either too vague or directly or indirectly misleading. I hence request that the annotation of these characters be changed to either "potential pendantic use" or at least "pedantic use".

Thanks.

Other Reports: Han

Date/Time: Mon Nov 8 20:07:33 CST 2010
Contact: koxinga@wanadoo.fr
Name:
Report Type: Error Report
Opt Subject: Unihan file (Unihan_Variants.txt): errors in simplified/traditional relationships

Hello,

Some of the relationships traditional-simplified in the Unihan database (Unihan_Variants.txt) look wrong to me. For example, on lines 34-35 we can see

U+346F kSimplifiedVariant U+3454
U+346F kTraditionalVariant U+3454

which should be (if I didn't mix them up ...)

U+346F kSimplifiedVariant U+3454
U+3454 kTraditionalVariant U+346F

My quickly done parsing program counted 1154 such pairs. It seems to be always in the order "kTraditionalVariant" then "kSimplifiedVariant", so can maybe be automatically corrected and then proofread. This correction should be pretty easy and I can help with that if you are interested, giving a diff file or a complete file with a list of changes.

Regards,

Date/Time: Sat Jan 8 11:56:31 CST 2011
Contact: brettz9@yahoo.com
Name: Brett Zamir
Report Type: Error Report
Opt Subject: UAX 38

For UAX 38, there are the following errata:

1) "This has change with" to "This has changed with"

2) "Altered the synax" to "Altered the syntax"

Also, stylistically, I'm not sure it is really all that necessary to say, "naïvely" in the sentence: "Users should not naïvely assume that learning to pronounce an East Asian language..."

And the sentence "The residual stroke count taken is from the first value in the character’s kRSUnicode field." when compared to the sentence "The radical number used is that of the first value in the character’s kRSUnicode field." I think would better clarify that they are using different parts of the first value.

Date/Time: Thu Jan 13 17:03:32 CST 2011
Contact: Robert.Siemer-unicode@backsla.sh
Name: Robert Siemer 司马洛
Report Type: Error Report
Opt Subject: TR38 traditional/simplified Chinese property description errors

Dear team,

Unihan has multiple issues regarding simplified/traditional Chinese relations. I elaborate one here, which hopefully makes it easier to talk about the others in the future:

TR38, seen here: http://www.unicode.org/reports/tr38/ Version “Unicode 6.0.0”

explains the properties kSimplifiedVariant and kTraditionalVariant. Both properties are very similar and use the same description template, but the one of kSimplifiedVariant seems outdated. E.g. it talks about a “value”, not “value(s)”, suggesting that kSimplifiedVariant does never have more than one value, which is wrong. (See 瀋 U+700B, which has two simplified variants.)

I suggest updating kSimplifiedVariant and put a reference in kTraditionalVariant instead of a full copy to 1) avoid this mistake happening again, 2) clear up that these are related fields and 3) that they work the very same way!

Further, the description itself is so flawed that the explanation of the mistakes needs more space than the proposal of a new one. This is why I do latter first and explain the problems further below. /p>

--- Proposal for the description of property kSimplifiedVariant:

The Unicode value(s) for the simplified Chinese variant(s) for this character.

A modern character which has neither kSimplifiedVariant nor kTraditionalVariant is used unchanged in simplified and traditional Chinese.

If a character has the kSimplifiedVariant and not the kTraditionalVariant property, it is a traditional character only and the property value lists the corresponding simplified character(s). In the opposite case the character is simplified only and the value lists the corresponding traditional character(s).

If both properties exist, the character exist in both simplified and traditional Chinese, but not just representing itself in which case none of these two properties would be given.

In case both properties are given, both include the character itself if it may map to itself as well, which is case for some 1-to-n mappings (e.g. 台 U+53F0). Otherwise both properties do not include the character itself, which means the character in simplified context has nothing to do with the same character in traditional context (rare, e.g. 苧 U+82E7).

Much of the of the data on simplified and traditional variants was graciously supplied by Wenlin Institute, Inc. <http://www.wenlin.com>.

--- end of proposal

The original description is misleading and erroneous, which I will explain here. The current description contains:

“Note that a character can be both a traditional Chinese character in its own right and the simplified variant for other characters (e.g., 台 U+53F0).”

Bad example. The example is not wrong, but 台 U+53F0 is not JUST the simplified variant for OTHER characters, but also for itself! 苧 U+82E7 is a better example of a character being traditional and a simplified variant of another character.

The description continues:

“In such case, the character is listed as its own simplified variant and one of its own traditional variants.”

Not true. Such a case (that is, a character is traditional and simplified at the same time (but not just for itself)) is indicated by merely having both properties kSimplifiedVariant and kTraditionalVariant at the same time. There is no requirement to list itself in one or both properties. Again, 苧 U+82E7 is such an example.

It further reads:

“This distinguishes this from the case where the character is not the simplified form for any character (e.g., 井 U+4E95).”

This is wrong as well. First, the expression “not the simplified form for any character” is very misleading, because it sounds like being a traditional character without being a simplified one. Character like this exist, but 井 U+4E95 is not one of these! -- In reality, the character 井 U+4E95 could have itself in kSimplifiedVariant and kTraditionalVariant and everything would be as clear as before, so nothing needs distinction here at all.

The real case when we need distinction, and the lack of marking of self-mappings is not expressed in the original description, but is so in my proposal.

Once the description reflects the meaning of these fields I will send in bugs in the Unihan DB itself.

Regards,
RRobert

Date/Time: Thu Jan 20 03:13:44 CST 2011
Contact: yoshiyuki.oguma@ubin.jp
Name: OGUMA, Yoshiyuki
Report Type: Error Report
Opt Subject: Incorrect glyphs in codebook

ED NOTE: The affected characters are covered in the report L2/11-036.

In the Unicode 6.0.0 codebook (http://www.unicode.org/Public/6.0.0/charts/CodeCharts.pdf) page 1289-1290, the glyphs of U+20534-U+20539 are incorrect.

Other Reports: Misc

Date/Time: Mon Nov 1 12:37:37 CST 2010
Contact: antonis.tsolomitis@gmail.com
Name: Antonis Tsolomitis
Report Type: Error Report
Opt Subject: U00B7 and U0387

Dear Sir, in the current standard characters U00B7 (periodcentered) and U0387 (anoteleia) are considered equivalent, and the standard actually says that we should prefer U00B7 instead of U0387. However, these are different characters.

periodcentered U00B7 is designed at half the x-height and DOES NOT EXIST in the Greek Grammar.

anoteleia (U0387) is designed at the x-height and it is a functional panctuation mark in the Greek language. I can definitely provide authoritative evidence for this from Greek grammar standard books.

So how come there has been this confusion? The suggestion of the consortium creates problems with font designers as well as people that maintain keyboard layouts for Greek. Is there any way to resolve this?

thank you,

Antonis Tsolomitis
Assistant Professor
Department of Mathematics
University of the Aegean

Date/Time: Sun Nov 28 11:56:34 CST 2010
Contact: paracelsus@gmail.com
Name: Dag Agren
Report Type: Error Report
Opt Subject: Incorrect description of U+1F4A0

Unicode 6.0 lists U+1F4A0 as "DIAMOND SHAPE WITH A DOT INSIDE", with a comment of "kawaii, cute". This seems to be a because of a failure to understand the original symbol it is mapped from, F99D in DoCoMo's extended Shift-JIS.

The reason is probably that the only image of this icon is very small and hard to see. It can be seen on http://potora.dip.jp/emojimente/index.php?mode=list_docomo&page=3, for instance. However, what it actually is is a picture of a flower, to signify cuteness. It is indeed labelled "kawaii" which means "cute".

"DIAMOND SHAPE WITH A DOT INSIDE" is just a description of the somewhat crude shape of the small icon for it, and does not at all seem to be a useful description of the code point. It seems it should be renamed "flower" or "cute flower" or something along these lines.

Date/Time: Thu Dec 30 19:23:14 CST 2010
Contact: liancu@microsoft.com
Name: Laurentiu Iancu
Report Type: Error Report
Opt Subject: Glue characters not listed in UAX #14

Please refer to http://www.unicode.org/reports/tr14/#GL.

UAX #14 was edited for Unicode 5.2 to clarify that the characters listed as examples of a given line-breaking class are not exhaustive. However, the section on glue (lb=GL) characters was not edited this way. Its wording implies that the list of glue characters in UAX #14 is still exhaustive.

In Unicode 6.0, two Tibetan annotation marks were encoded with lb=GL:

- U+0FD9 TIBETAN MARK LEADING MCHAN RTAGS
- U+0FDA TIBETAN MARK TRAILING MCHAN RTAGS

Assuming their line-breaking class is correct, either these characters should be added to the list of glue characters in UAX #14 or that section should mention that the list is not exhaustive.

Happy New Year!

Date/Time: Wed Jan 12 02:10:42 CST 2011
Contact: matial@il.ibm.com
Name: Matitiahu Allouche and Mohamed Mohie
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: proposal for a new character: Arabic Letter Mark (ALM)

ED NOTE: A separate document L2/11-005 has been submitted on this, and it is included here only for completeness.

Unicode includes the LRM (U+200E) and RLM (U+200F) characters. They are invisible characters which creators of bidirectional text can use to solve display issues that the UBA (Unicode Bidirectional Algorithm) does not address adequately. The use of these characters is mentioned and even recommended in tutorials like:

- H34: Using a Unicode right-to-left mark (RLM) or left-to-right mark (LRM) to mix text direction inline ( http://www.w3.org/TR/WCAG20-TECHS/H34.html )

- Internationalization Best Practices: Handling Right-to-left Scripts in XHTML and HTML Content ( http://www.w3.org/International/geo/html-tech/tech-bidi.html#ri20030726.140315918 )

However, RLM may not be appropriate in an Arabic context, because while it is effective from the ordering point of view, it neutralizes the effect of preceding Arabic letters on the following Arabic-European digits. The UBA specifies that Arabic letters form an Arabic context wherein following Arabic-European digits must be handled as Arabic-Indic digits, but the presence of an RLM, which may be needed for ordering reasons, destroys this context. In addition, there is a need to transform Arabic-European digits into Arabic-Indic digits when these digits are positioned at the start of the text like in formulas and numbered lists.

What is needed is a new character equivalent to RLM, but with the same bidi character type as Arabic letters. Such a character could be named ALM (Arabic-Letter Mark). It will be a normally invisible character (like RLM) with a bidi character type AL (unlike the R bidi character type of RLM), and this character should be a non-joiner as related to Arabic text shaping.

Such a character must be located in the BMP, preferably in block 20xx like LRM and RLM.

Please let me know how to proceed with this submission.

Date/Time: Mon Jan 31 03:37:44 PST 2011
Contact: geobulga@yahoo.it
Name: Giorgio Bulgarelli
Report Type: Other Question, Problem, or Feedback
Opt Subject: Unicode characters for the Romanised transliteration of Amharic

ED NOTE: This person received one reply from the Unicode office with a pointer to more information.

Dear Sirs,

Please, accept my sincerest apologies for any inconvenience this e-mail may cause you. /p>

I have to point out that I am neither a transliteration nor an encoding expert, and that I don't have a perfect knowledge of English.

My problem is that of correctly transliterate Amharic names of monuments and geographical names into Latin characters instead of using one of those various existing anglicised transcriptions.

AAs you probably know, there is no internationally agreed ISO transliteration of Ge'ez characters, as it is, for example, for Arabic, Cyrillic and Greek scripts. There are several transcriptions used locally and in tourist guides that often lead to different and even matchless results. For example, the name of mount ራስ ደጀን is written with the five Ethiopic Unicode characters 122B-1230 12F0-1300-1295 (=RaaSa DaJaNe) but you can find it transcripted in four different ways —Ras Dajen, Ras Dejen, Ras Dashen and Ras Deshen— whilst it should be —in my opinion more properly— written only Ras Dajan. Some times, the transliterated word sounds so strange that locals are even unable to realise what it is!

Now, the widely used Unicode character set UTF-8 allows to type all and every character of any language that uses Latin scripts, as well as any transliterations into romanised characters of languages with non-Latin scripts. I found two semi-official scholarly used transliterations developed in Germany, the country where the first and most important studies on Ethiopia have been carried on. They are the transliteration developed by Ernst Hammerschmidt, professor of African and Ethiopian Languages and Cultures at the University of Hamburg, in the '70s and, more recently, the EAE transilteration, developed by Encyclopaedia Aethiopica (http://www1.uni-hamburg.de/EAE/transf.html). You can find on this site the transliteration table that is based on the Transcription/transliteration system of the EAE. The related document "EAE-Phonetiktabelle.doc" can be downloaded from there.

Unfortunately, I was unable to find —and type and/or print— any Unicode character to transliterate the following Ethiopic ones:

ቐ 1250 (QHa) transliterated with LATIN LETTER Q WITH LINE BELOW ጨ 1328 (CHa) transliterated with LATIN LETTER C WITH CARON AND DOT BELOW ጰ 1330 (PHa) transliterated with LATIN LETTER P WITH DOT BELOW ፀ 1340 (TZa) transliterated with LATIN LETTER S WITH ACUTE AND DOT BELOW ቈ 1248 (QWa) transliterated with LATIN LETTER Q WITH APEXED LETTER CAPITAL W ኈ 128A (XWa) transliterated with LATIN LETTER H WITH BREVE BELOW AND APEXED LETTER CAPITAL W ኰ 12B0 (KWa) transliterated with LATIN LETTER K WITH APEXED LETTER CAPITAL W ጐ 1310 (GWa) transliterated with LATIN LETTER G WITH APEXED LETTER CAPITAL W

I wonder why, with so may Unicode Latin characters, at Encyclopaedia Ethopica they had to invent them, and even use some digraphs, whilst even more than an existing Unicode Latin character would have been able to represent them.

In any case, Encyclopaedia Aethiopica has made available the font set EAE Garamond.ttf that can be downloaded from their site.

Unfortunately, since its characters do not belong to the Unicode set, it is not possible to display and print them unless the related font is installed. /p>

When do you expect to add these characters to the Unicode Latin set?

Do you think that the ISO/IEC 9995-3:2010 keyboard layout (downloadable from the site http://www.iso.org/iso/catalogue_detail.htm?csnumber=52869) should allow to type them?

Looking forward to hearing from you at your earliest convenience, I remain,

YYour sincerely,

Giorgio Bulgarelli

Date/Time: Tue Feb 1 08:33:59 PST 2011
Contact: tim@pederick.id.au
Name: Tim Pederick
Report Type: Error Report
Opt Subject: UAX#42 Copy/paste error

In UAX #42, subsection 4.4.15 Indic Properties, the formal definitions of the two properties bear the property names of the two from the previous subsection, each in one place.

Current lines:

[[hst property, 37] =
...
[jamo property, 38] =

Probably should be:

[[InSC property, 37] =
...
[InMC property, 38] =

Date/Time: Wed Feb 2 05:46:59 PST 2011
Contact: arietencate@zonnet.nl
NName: Arie ten Cate
Report Type: Error Report
Opt Subject: characters are in mirror image

The characters of the Phaistos Disc are shown in mirror image: left and right exchanged. This is easily seen by comparing the signs with a photograph of the disc (any of the two sides), such as one of the excellent photographs at http://en.wikipedia.org/wiki/Phaistos_Disc