Comments on Public Review Issues (Nov 13, 2004 - Feb 3, 2005)

The sections below contain comments received on the open Public Review Issues as of February 3, 2005, since the previous cumulative document was issued prior to UTC #101 (November 2004). Two closed issues received feedback during the period.

At the end, Unicode 4.1.0 beta feedback comments that were submitted via the reporting form are appended.

46 Closed Issue: Proposal for Encoded Representations of Meteg

Date/Time: Mon Nov 29 09:53:20 CST 2004
Contact: Peter R. Mueller-Roemer

The need for for 3 relative graphical positions of Meteg is well documented. The suggested proposal of using control characters, though, is not optimal. The insertion of hidden characters (without graphical or at least recognizable effect) should be the exception.

To make Unicode more useful to a large group of people who have only occasional use of special complex / combined characters a general default rule for the graphical representation would be preferable. See my conference-proposal "A single keyboard layout ..." ( 'single' is NOT the essential point). In Hebrew, Greek and Latin-based alphabets, combining diacritical sequences should follow general default graphical composition rules, that could in rare cases be changed by control characters. Rather than leaving it to the implementers what default they want to establish.

There are already good solutions for individual complex characters (graphically representations of base+combining sequence or of single code-points that could be decomposed), but I see a almost general de facto default rule to represent combining sequences graphically by the outdated overstrike composition.

A positive example (worked in Outlook Express, Word, OpenOffice 1.1.3.exe, Netscape 7.1 with Arial Unicode MS) : u + Umlaut/Trema/diaeresis + macron above, and in exchanged order u + macron above + diaeresis yield TWO distinct well readable complex characters, ǖ ṻ , which are needed and can be understood in different contexts e.g. :

eg1: guide to pronunciation: German u-Umlaut as long vowel versus long u spoken as a separate vowel. eg2: technical writing: average of second derivative of u versus second derivative of averaged u (the former 2nd derivative does not exist .in some cases where the last does).

Enthusiastic response of the occasional user for such complex symbols: Finally a simple way to type such symbols with just 3 keystrokes (or combinations with e.g. shift) rather than the lengthy process of finding and inserting them from a symbol table.

The eg1:-user will of course expect the same good behaviour for the base-characters o, e, i and will expect an acute accent (for main stress) to combine above these complex characters rather than merging in an overstrike fashion. Big disappointment in the implementations I tested.

The eg2-user will even expect such combinations to yield TWO results with ANY other base character (symbols used for meta-variables, Greek capital Delta or Nabla, even a couple of Hebrew consonants are used in math/logic). Some combining diacritical marks are combined in the expected way even for tall or taller letters (but base T absorbs diaeresis and macron in an overstrike manner). Thus there is no difficulty implementing the desired behaviour and a default rule (for any base-character and combining sequence) would actually ease the implementation of fonts etc. portable across computer- and SW-systems, national and cultural boundaries.

The Unicode Standards Committee seems to me to have the policy that for any combination of base+combining sequence a separate character proposal has to be submitted. I hope I am wrong. I am pleading for accepting proposals in behalf of the wider future Unicode-users of large classes of character and there composition.

Greek and Hebrew need easily typable decompositions of complex characters. Narrow diacriticals above h-mark and tonus should center above a Greek vowels and (at least) Greek Rho when alone but 2 of them together should appear clearly separated side-by-side above the any base and be centered together above narrow and wide bases (iota, o-mega). Present implementation does not even allow in all necessary places to combine in the desired was.

Hebrew accents are well implemented for isolated appearance below a base, but need to also combine with vowel-points side-by-side but presently are not so implemented. One can't presently copy the first word of the bible e.g. from BHS without running into problems.

For the occasional quoting of a short biblical phrase within an English text it would be much simpler to type base + swa + vowel-point instead of having 3 separate places to look for hatef-vowels. The problem with Meteg would be taken care of by the suggested standard for graphically composing complex characters.

I will be able to suggest a recursive rule that would be good for most Latin-based alphabets and their complex symbols, Greek and Hebrew. I hope to find a positive echo and help from UTC and the Unicode-interested world. Though not representing any institution or company I am speaking for many multi-lingual and technical writers, who do not have the time to make individual proposals.

Yours truly

Peter R. Mueller-Roemer

47 Closed Issue: Changes to default collation of Latin in UCA

Date/Time: Wed Nov 24 06:01:06 CST 2004
Contact: Mattias Ellert

1. Letter Æ

That the current UCA doesn't treat Æ as AE is in my mind its biggest flaw. The UCA is supposed to be a "sensible default". At the moment every single localization must customize its treatment of this letter, so it is as far from a sensible default you can get. Making this change is necessary for the UCA to be what it was designed to be, i.e. a sensible default. Not changing this risks making the UCA a non-standard, since everyone has to ignore it on this point.

2. Diacritics a secondary difference

This is also a good suggestion, IF you make a clear distinction between base letters modified by adding a diacritic and letters created by modifying the shape of the base letter.

At the moment the UCA is almost consistent here, giving secondary difference to characters changed by adding a diacritic, but giving primary difference to characters created by modifying the shape of the base letter. If the few deviations from this rule are changed to make it consistent the UCA would be better.

Of the characters listed in the background document Ø (O with stroke), U+0110 (D with stroke), Ł (L with stroke), U+013F (L with middle dot), Ħ (H with stroke), U+0166 (T with stroke), U+01E4 (G with stroke), U+0197 (I with stroke) and U+01B5 (Z with stroke) falls in the first category and should be changed to have only secondary difference to the base character.

However the characters Ð (eth) and U+0189 (African D) do not fall into this category since their lower case equivalents do not look like small d with stroke.

It is not a good idea to change the letters where the shape of the base letter has changed to have only secondary difference, since it will be very difficult to do this classification. Is e.g. ezh sufficiently different form z to be sorted as a separate character or not? Should all G's be sorted as C's since G is a modified version of C? Every single person will answer these questions differently, and even the same person might give different answers in different circumstances. The only reasonable default is to treat each modified letter as a base letter in its own right with primary difference in the UCA.

Mattias Ellert

48 Closed Issue: Definition of "Directional Run"

Date/Time: Wed Nov 17 18:15:04 CST 2004
Contact: Tex Texin

This definition: "BD3b Directional Run: A sequence of characters that starts and ends at directional boundaries, and otherwise contains no directional boundaries." should refer to the "longest sequence".

Since directional boundaries are defined to be the ends of a sequence of characters, then a string of n characters, would have a run between any sequence of characters that does not itself span a boundary. i.e. the sequences {1}, {1,2}, {1,2,3}, {2,3}, etc. up to and including {1,n}... {n-1,n} would be runs.

Although it might be useful to agree these all are in fact runs, I think the definition really intends to define the longest sequence so there is a single answer with respect to the bidi algorithm.


51 Proposed Update UAX #29 Text Boundaries

Date/Time: Fri Jan 28 13:13:45 CST 2005
Contact: John Cowan

The readability of Table 3 would be improved if, instead of saying things like "and not GRAPHEME EXTEND = true" it said "and GRAPHEME EXTEND = false".

53 Proposed Draft UTR #33 Unicode Conformance Model

No feedback received this period.

54 Proposed Update UTS #22 Character Mapping Markup Language

No feedback received this period.

56 Proposed Update UAX #14 Line Breaking Properties

Date/Time: Thu Dec 9 03:14:23 CST 2004
Contact: Mattias Ellert

U+16EB, U+16EC and U+16ED belong to the list of characters in the Word Separator list following the paragraph

"Historic text, especially ancient ones, often do not use spaces, even for scripts where modern use of spaces is standard. Special punctuation was used to mark word boundaries in such texts. For modern text processing use these should be treated as linebreak opportunities by default. WJ can be used to override this default, where necessary."

Date/Time: Fri Dec 17 20:43:30 CST 2004
Contact: Asmus Freytag

Based on discussion on the list, the following probably are best re-classified as EX:

Kamal Mansour wrote:

>>     060C;AL # ARABIC COMMA
> These are only used as sentence ending punctuation and are not used
> as part of numbers. No spaces occur between any words and a following
> punctuation. Finally, Arabic is not commonly used in conjunction with
> ideographic text which simplifies things.

In traditional Arabic typography, one often sees spaces surrounding a punctuation mark such as comma or any of the others above. Over the past decade, DTP has somewhat reduced the frequency of this practice, but for the purpose of an algorithm, one couldn't count on the lack of white space between a word and an adjoining punctuation mark. The situation for Arabic would not be so different from French practice with regard to spacing around punctuation.

-- end --

The effect of this change (AL->EX) is to be more restrictive when a space separates a word from a punctuation character. Class EX does not interact with numeric punctuation the way CL or IS do.

EX would not be a good choice if any of these characters could legitimately ever occur as the first character after a computed line break (other than after ZWSP which is the generic override).

I suspect very few instances of line breaks in existing data would be affected, and the majority probably positively.


57 Changes to Bidi categories of some characters used with Mathematics

No feedback received this period.

58 Characters with cedilla and comma below in Romanian language data

Date/Time: Mon Jan 31 02:06:45 CST 2005
Contact: Laurentiu Iancu

Dear Unicode Representative:

With respect to the public review issue #58 (cedilla vs. comma below in Romanian data), this report summarizes my findings as a native speaker of the language and the feedback that I collected from friends working in the Romanian IT industry. There is no need to acknowledge this report.

The primary sources of Romanian-language data are:

- Databases inherited or migrated from old DOS systems;
- Automatic import from other online sources (barcode readers, OCR);
- Manual input from the keyboard. I tend to believe that this case constitutes the main source of data.

Romanian-language data can be found in an encoding that falls in one of the following four categories:

  1. No diacritics at all -- data encoded in plain ASCII, CP 850, CP 852, ISO 8859-1, or Windows-1252;
  2. Some diacritics above -- data using only a-circumflex and i-circumflex and encoded in the same code pages as above;
  3. Cedillas -- data encoded in ISO 8859-2 or Windows-1250, or, to a much lesser extent, in Unicode;
  4. Commas below -- data encoded in Unicode or ISO 8859-16.

Categories 1 and 3 are by far the most widely used. The bulk of the e-mail uses either ASCII or ISO 8859-2/Windows-1250 (Windows CE settings). Newsgroups and Web sites that I visited also use one of these encodings. Some Web sites that attempt to improve the legibility of their content or are more concerned with the language (such as online dictionaries, journals, or public institutions) use cedilla even with pages encoded in UTF-8. Consequently, characters with cedilla are more widespread.

I believe that the prevalence of cedilla is due to the default settings of keyboard drivers and fonts used. For instance, the Romanian regional settings in Windows XP generate the characters with cedilla. Fonts such as Times New Roman and Verdana, which are widely used in Web pages, allow the display of the same characters as well, and do not contain glyphs at the comma-below code positions.

When applied to manually-entered text as the main source of data, these default settings propagate as the de facto standard. However, there are users who are aware of the issue and admit their frustration of not being able to enter or visualize the correct comma-below forms, or at least not easily.

In my opinion, the chances that the comma-below characters be adopted in Romanian data can be increased by making them part of the default settings in systems and repositories such as the CLDR. Otherwise, the erroneous de facto situation can only be further enforced. This concern is especially true for the old collections of data with no diacritics, which sooner or later will be migrated to new encodings. In such applications, the CLDR may play a decisive role.

I therefore urge the CLDR Technical Committee to use the comma-below variants as default settings for the Romanian locale, part of CLDR.

Best regards,

Laurentiu Iancu

59 Disunification of Dandas

Date/Time: Thu Dec 23 22:22:19 CST 2004
Contact: Omi Azad

I vote for Disunification. All Indic languages should have their own Danda and Double Danda by their local nick.


Date/Time: Thu Dec 23 22:23:02 CST 2004
Contact: Hasin Haider

All Indic languages should have their own Danda and Double Danda by their local names.

Hasin Haider

Date/Time: Tue Jan 25 19:09:43 CST 2005
Contact: Michael Everson

I know it is not a vote, but I do not wish to rehearse all of the details of my position here. So... see them, as arguments for the encoding of script-specific dandas.

60 Proposed Update UAX #9 Bidirectional Algorithm

No feedback received this period.

61 Proposed Update UAX #15 Unicode Normalization Forms

Date/Time: Mon Jan 17 19:57:17 CST 2005
Contact: Philippe Verdy

When looking at the various proposed changes in UAX#15 (Normalization), I noted the following sentence in Annex 6 (Legacy Encodings), rule D5, paragraph 2:

(...) for example, Shift-JIS may have two different mappings used in different circumstances: one to preserve the '/' semantics of 2F(16), and one to preserve the '¥' semantics.

I wonder if this should not be ended instead by:

(...) one to preserve the '\' semantics of 5C(16), and one to preserve the '¥' semantics.

because Shift-JIS normally remaps the ASCII backslash i.e. U+005C (not the forward slash i.e. U+002F), in the low single-byte-encoded 7bit subset, at position 5F(16) to give space to the yen symbol.

Date/Time: Tue Jan 25 07:27:57 CST 2005
Contact: Simon Josefsson

Hello. Regarding the PR29 modification part of #61:

This change appear to break backwards compatibility and normalization stability. The PR29 text suggest that the problematic sequences do not occur naturally. My question then is: why break normalization stability over something that doesn't appear to be a practical problem?

Translating my question into a proposal:

Keep the normative part of TR15 as-is, but fix the examples and introduction to match the normative text. Add a note on the NFC/NFKC idempotency, to say that idempotency is the goal, but that for a select few strings it does not hold and that normalization stability was considered more important than theoretical normalization idempotency.

I am not convinced this proposal would be better than what you propose in the long run. However, I am concerned that normalization stability is given so little weight that it is violated even for situations that doesn't appear to have practical consequences.

Thanks, Simon

Date/Time: Tue Jan 25 11:33:57 CST 2005
Contact: Markus Scherer

Just editorial comments: (quotes from draft, each followed by proposed change; additional words inserted into quote with brackets)

If the visual distinction is stylistic, [then] markup or styling could [be] used to represent the formatting information. - missing "be", suggest to also add "then" to make it clearer

Section references are sometimes italicized, sometimes not.

A character may have a canonical decomposition to more than two characters, but it [is] expressed as... - missing "is"

This analysis can also be used [to] produce more compact code than what is given below. - missing "to"


Date/Time: Sun Jan 30 00:40:12 CST 2005
Contact: SADAHIRO Tomoyuki

I have a comment on the public review #61, "Proposed Update UAX #15 Unicode Normalization Forms"

The discussion in the public review #29 (Normalization Issue) seems to premiss that any character sequence in question is always in canonical order. Then it concludes that the case of i > k [i.e. where both ccc(B) and ccc(C) are non-zero and ccc(B) is greater than ccc(C)] is irrelevant.

But I have the impression that a character sequence is not always in canonical order under the statements of the definition D2. Especially, the third paragraph of D2 begins with "if" ("If a combining character sequence is in canonical order, ...").

If the revised definition can be applied to a character sequence that is not in canonical order, the result is B blocks C when i > k, but this result is wrong. Supposing that ccc(B) > ccc(C) > 0 = ccc(S) and S-C is a precomposite of S and C, the sequence <S, B, C> is canonical equivalent to both <S-C, B> and <S, C, B>.

Cf. the section 4.2 in UTS #10 ("UCA") notes as follows.

Note: A combining mark in a string is called blocked if there is another combining mark of the same canonical combining class or zero between it and the last character of canonical combining class 0.

According to this note, B does not block C when i > k. I think this note is preferable, except that combining marks differ from non-starters in definition. (ccc. of combining marks may be zero while that of non-starters is not.)

My suggestions:

(1) Please declare "any character sequence" subjected to D2 must be in canonical order. In other words, D2 must not be applied to any character sequence that is not in canonical order.

(2) If D2 does not premiss canonical order, please ensure that B must not block C when i > k. I.e. results are as follows

ccc(S)=0,ccc(B)=0,ccc(C)=0:  B blocks C
ccc(S)=0,ccc(B)=0,ccc(C)=k:  B blocks C
ccc(S)=0,ccc(B)=i,ccc(C)=0:  B blocks C
ccc(S)=0,ccc(B)=i,ccc(C)=k=i:  B blocks C
ccc(S)=0,ccc(B)=i,ccc(C)=k>i:  B doesn't block C
ccc(S)=0,ccc(B)=i,ccc(C)=k<i:  B doesn't block C

Thank you.

Date/Time: Thu Feb 3 17:20:58 CST 2005
Contact: Philippe Verdy

If I understand UAX#15 clearly, the fix just states now that a starter character will be blocked from composition with a previous starter character as soon as there's a non-starter combining character between.

If CGJ is a starter character, then is will be effectively blocked from composition with the first starter character of a default grapheme cluster. So any non-starter combining character appearing after this CGJ will not apply/combine to the start of the combining sequence, even though it will still be part of the default grapheme cluster and the combining sequence.

As a consequence, CGJ blocks reordering of non-starter combining characters even if they don't interact (when they have distinct combining character). This is effectively used for handling the case of historic Hebrew combining sequences, where CGJ prohibits this reordering, but it still causes a problem for layout renderers which now must consider the two substrings around CGJ isolately.

For these cases, blocking composition with CGJ during normalization should not block composition for rendering (notably the result of the combinination may be allowed to be different from the rendering a distinct glyph for the previous substring and the next glyphs representing the non-starter combining characters encoded after CGJ.

The change does not seem to impact the way Hebrew (with or without CGJ) will be processed, normalized or rendered. However the formulation of the change seems strange because it does not exhibit explicit the only case where this existed. ("or higher than" is a bit strange).

You propose to change:

D2. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same combining class as C.


D2'.In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same or higher combining class as C.

I would have prefered the following more explicit sentence:

D2'.In any character sequence beginning with a starter S, a character C (after S) is blocked from S (by B) if and only if there is some character B between S and C, so that either: B is a starter; or B is a non-starter and C is a starter; or C is a non-starter that has the same combining class as B.

The difference is in the second alternative added in the "either" list, and the third alternative (which is the case that was already not ambiguous in the existing definition).

I also wonder if the list of characters that may cause problems are only those listed in the document. For me it should be listing all characters that have (in the main UCD definition file, or by the Hangul <L, V> or <LV, T> composition rules), a canonical decomposition mapping to a pair of starter characters not excluded from recomposition (in the CompositionExclusion list of the UCD). Is this list really exhaustive? Shouldn't this precision be more explicit?

Now comes the tricky case. Does composition blocking still applies between a base starter character and the first non-starter character after it, if there's only a CGJ or some other a ignorable character between them, within the same combining sequence?

Now what about Hebrew sequences: <base-consonnant, meteg, CGJ, dagesh>
- meteg is not blocked from base-consonnant but does not compose
- CGJ is blocked from base-consonnant by meteg
- dagesh is blocked from base-consonnant and from meteg by CGJ
But in the existing ambigous definition, CGJ could have been unblocked from consonnant (however CGJ does never compose with any character).

62 Proposed Update UTS #10 Unicode Collation Algorithm

Date/Time: Thu Jan 20 00:41:00 CST 2005
Contact: Ake Persson

1) The following sentence is not true:

In Slovak, the digraph ch sorts as if it were a separate letter after c. Should be: after h.

2) A search for "rearrangement" reveals the need for some additional editing.

Date/Time: Tue Jan 25 18:00:11 CST 2005
Contact: Markus Scherer

In Modifications

This is to provide a better much better default...

- remove first "better"

In 1 Introduction:

In matching, the same can occur, which can have cause significant problems for software customers; ...

- remove "have"

3.1.3 Rearrangement

- this section needs to be removed

In C3:

A conformant implementation that supports backward levels, variable weighting, semi-stability or rearrangement shall do so in accordance with this specification.

- remove rearrangement from the list of features

In 3.2.1 File Format:

Each of the files consists of a version line followed by an optional variable-weight line, optional rearrangement lines, optional backwards lines, and a series of entries, all separated by newlines.

- remove "optional rearrangement lines" if it refers to logical-order rearrangement (I am not sure) - note that the modifications for version 9 say "removed rearrange from the file syntax in 3.2.1 File Format" apparently this should have been removed from the text, not just the syntax, already in version 9?

In Hangul Trailing Weights, Interleaving Method:

If not, then find the least syllable that it is greater than; call that the base syllable.

- Is this not the _greatest_ syllable that it is greater than, otherwise the least syllable is almost always U+AC00?

In 8 Searching and Matching (Informative), DS4b (medial match):

- In the example table, I propose to put the medial example between the minimal and maximal ones. At first, I assumed that the examples would be in "ascending" order, and got confused.

- Similarly, it is probably better to swap DS4a and DS4b.


Date/Time: Sun Jan 30 11:51:06 CST 2005
Contact: Kent Karlsson

The following two changes are needless:

After U+0E24 ฤ THAI CHARACTER RU Insertion of the sequence: U+0E24 ฤ {THAI CHARACTER RU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO

After U+0E26 ฦ THAI CHARACTER LU Insertion of the sequence: U+0E26 ฦ THAI CHARACTER LU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO

since the following change is also made:


Actually, the latter change should be modified to collate U+0E45 ๅ THAI CHARACTER LAKKHANGYAO just after U+0E32 า THAI CHARACTER SARA AA (since LAKKHANGYAO is just a tallish variant of SARA AA).

Date: Thu, 27 Jan 2005 15:01:01 +0100
Contact: Bernard Desgraupes

Forgive me if this has already been reported or if I'm just misunderstanding, but I think there is a mistake in the description of the algorithm to compute a default collation element for characters with compatibility decompositions.

This is in paragraph 7.3 (Compatibility Decompositions). It says the following (under point 3.) :

3. Set the first two L3 values to be lookup (L3), where the lookup uses the table in §7.3.1 Tertiary Weight Table. Set the remaining L3 values to MAX (which in the default table is 001F):

0028 [*023D.0020.0004] % LEFT PARENTHESIS
0032 [.06C8.0020.001F] % DIGIT TWO
0029 [*023E.0020.001F] % RIGHT PARENTHESIS

In that case, the level 3 weight for character 0032 should 0004 instead of 001F. So we should have:

0032 [.06C8.0020.0004] % DIGIT TWO

This is corroborated by the already computed value found in allkeys.txt :

2475  ;
[*0288.0020.0004.2475][.0E2B.0020.0004.2475][*0289.0020.001F.2475] #

Since I did not see any correction in the beta release of 4.1.0, I thought I'd mention it ( I know UTR10 is just a TR, not an annex but anyway).



63 POSIX Data for CLDR

No feedback received this period.

Other Issue: Unicode 4.1.0 Beta Comments

Date/Time: Thu Jan 6 18:42:13 CST 2005
Contact: Markus Scherer

I am working my way through the Unicode 4.1 beta data files and have started an issues list at http://www.mindspring.com/~markus.scherer/unicode/ucd41b-review.html

I will keep this page updated but also keep the issue numbers stable. I will file brief reports once in a while when I add issues.


Date/Time: Fri Jan 14 13:28:31 CST 2005
Contact: Andy Jewell

The file http://www.unicode.org/Public/4.1.0/ucd/Blocks-4.1.0d2.txt reports the new Coptic block as

2C80..2C8F; Coptic

where it should be

2C80..2CFF; Coptic

-- Andy Jewell

Date/Time: Sat Jan 15 09:40:51 CST 2005
Contact: Elmar Kniprath

I should like to make some comments regarding the upcoming new version of the Unicode standard. Here are my remarks and wishes regarding Indic scripts:

I have compared the position of letters in the respective Unicode ranges. Except for Sinhalese, the correspondence between the tables is almost perfect. What I'm missing is a retroflex aspirated RHA in Gurmukhi, abbreviation signs in Gujarati and Gurmukhi (the latter looking like ":") and nukta consonants in Gujarati. I'm aware of the fact that the Gurmukhi RHA is a ligature and can be entered as two characters, but it may be considered as monophonemic and thus should be treated the same way as the Nagari RHA. The Gurmukhi abbreviation sign ought to be placed in U+0A70, a position currently being occupied by TIPPI. By the way: An abbreviation sign for Gujarati would have spared the RUPEE sign U+0AF1. Nukta consonants can generally be generated in OT fonts by typing "consonant + NUKTA", but why are nukta consonants provided for Devanagari and Gurmukhi but not for Gujarati?

These are minor wishes. Of much more importance for me, and I think also for other people, is the lack of a Unicode position for the "inherent vowel" in all scripts derived from Brahmi (except Khmer). I am sure that this has been extensively discussed by the Unicode consortium. And I could imagine that the reason for its rejection might have been, that it is not needed for writing these scripts, because it is a zero-graph. Certainly the vast majority of users would never use it, but I am convinced that there is a small, but not unimportant minority of linguistic or indological "egg-heads", who would like to have the possibility of a 1:1 transliteration of Indic texts. This is of use e.g. in compiling a dictionary or in teaching. So I have made a latin transliteration font which can be used for all Indic scripts. It contains the corresponding latin glyphs in all Indic Unicode ranges and thus allows easy transliteration just by changing the font. It works for all Indic Unicode based fonts, also with Open Type fonts. As the transliteration must show the inherent vowel, I have inserted a non spacing invisible mark into each of my Indic fonts immediately after the letter HA (e.g. in Devanagari U+093A). The corresponding fields of my Latin font contain "a". A better position of this "matra a" would of course be emmediately before the "matra aa". But this place is already occupied by AVAGRAHA. Of course my latin font transliterates the consonants KA KHA etc. as "k" "kh" etc. This is needed for transliteration of modern Indoarian languages (except Oriya and Sinhalese), because they often use the full consonants without halant where there is no inherent "a". Example: Hindi समझना (samajhnaa) vs. समझाना (samjhaanaa). The extremum is Panjabi which never uses halant and has almost no ligatures.

I would be very gratefull, if you could reconsider this issue. An invisible "matra a" would not be an inconvenience for any user of Unicode based fonts, and nobody would be urged to use this sign, as for instance nobody is urged to use any of the Tamil symbols U+0BF3 to 0BF8. But there might be less users of the latter than of "matra a".

Best regards
Elmar Kniprath

Date/Time: Wed Jan 19 19:52:41 CST 2005
Contact: Markus Scherer

Pattern_Syntax contains some characters that either have identifier-like properties or numeric values or are compatibility variants of such characters.

I propose to remove the following characters from Pattern_Syntax:

1. The following 4 characters are also in ID_Continue (they are Pc=Connector_Punctuation)

_	U+005F LOW LINE	Connector_Punctuation	Basic_Latin	Zyyy	-	ON	
‿	U+203F UNDERTIE	Connector_Punctuation	General_Punctuation	Zyyy	-	ON	
⁀	U+2040 CHARACTER TIE	Connector_Punctuation	General_Punctuation	Zyyy	-	ON	
⁔	U+2054 INVERTED UNDERTIE	Connector_Punctuation	General_Punctuation	Zyyy	-	ON

2. The following 52 characters are also in Alphabetic:

Circled letters A-Z and a-z

3. Other compatibility variants (circled, parenthesized, etc.) of letters and digits; the digit variants have numeric values. (190 characters)



Date/Time: Wed Jan 19 22:33:53 CST 2005
Contact: Jony Rosenne

I suggest that Unicode adopts the SII proposal concerning Hebrew changes, as per L2/04-429, and delete all Hebrew changes and additions.


Date/Time: Thu Jan 20 04:35:57 CST 2005
Contact: Jonathan Kew

The combining classes assigned to new Arabic vowel marks in the Unicode 4.1 beta are inconsistent. It is suggested that they be revised (in either one of two possible ways, see below) to improve consistency with the existing characters.

Of the six new Arabic vowel mark characters at 0659..065E, four have general "above" or "below" CC values, while two are assigned to the same "fixed position" classes as the most closely-related existing vowel marks. There is no logical reason for this difference; I believe it has arisen accidentally as characters have been proposed at different times.

The two characters 065D REVERSED DAMMA (CC=31) and 065E FATHA WITH TWO DOTS (CC=30) have been assigned the same combining classes as their "standard" counterparts DAMMA and FATHA respectively. I suggest that the same practice should be followed for the other marks, making the following changes to combining class values:

    0659    230 -> 30
    065A    230 -> 30
    065B    230 -> 30
    065C    220 -> 32

This considers the three vowel signs above to be most similar to FATHA, and the one below to be similar to KASRA.

While I believe this would be the most logical change, I accept that the result is still not fully satisfactory. Ideally, ALL Arabic vowel signs above would share the same CC value, as all could potentially interact; similarly all those below. But this is impossible because of the already-assigned classes.

If this change is NOT adopted, then the alternative "regularization" of the new characters should be considered; namely:

    065D    31 -> 230
    065E    30 -> 230

This suggestion is based on the view that it was a mistake to assign fixed-position classes to Arabic vowel marks at all (they should have had the generic "above" or "below" classes), and so all new vowels will be given those classes rather than added to the existing fixed-position ones.

No changes that are possible at this point can offer a completely satisfactory outcome, but I believe either of the above suggestions would be an improvement, in terms of overall consistency for implementers, than assigning some new vowels to each of both the generic AND fixed-position classes.

Date/Time: Tue Feb 1 07:25:54 CST 2005
Contact: Kent Karlsson


The comment on transparent (T) has not been updated to reflect that g.c. Me characters now are of joining type T.


Some characters are listed as "ambiguous" while they are of g.c. Lu or Ll. These should instead be listed as "AL". The same goes for the roman numeral characters.

U+0085 NEXT LINE should be listed as a mandatory line break character BK (it does not need a separate line break property (NL)).

The following characters should also be listed as mandatory line break (BK) characters (the last three since at least the bidi alg. considers them to be paragraph boundaries; as for VT and FF more generally).

000B;<control>;Cc;0;S;;;;;N;LINE TABULATION;;;;
000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
001C;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR FOUR;;;;
001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;
001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;

The following should have line break property BA (compare other dandas):



MICRO SIGN should be listed as Greek. Similarly, the following characters should not be "Common", but Arabic, Hebrew, Latin, or Greek, as appropriate.

2102          ; Common # L&       DOUBLE-STRUCK CAPITAL C
2103          ; Common # So       DEGREE CELSIUS
2107          ; Common # L&       EULER CONSTANT
2109          ; Common # So       DEGREE FAHRENHEIT
210A..2113    ; Common # L&  [10] SCRIPT SMALL G..SCRIPT SMALL L
2114          ; Common # So       L B BAR SYMBOL
2115          ; Common # L&       DOUBLE-STRUCK CAPITAL N
2116..2118    ; Common # So   [3] NUMERO SIGN..SCRIPT CAPITAL P
2124          ; Common # L&       DOUBLE-STRUCK CAPITAL Z
2128          ; Common # L&       BLACK-LETTER CAPITAL Z
212C..212D    ; Common # L&   [2] SCRIPT CAPITAL B..BLACK-LETTER CAPITAL C
212F..2131    ; Common # L&   [3] SCRIPT SMALL E..SCRIPT CAPITAL F
2133..2134    ; Common # L&   [2] SCRIPT CAPITAL M..SCRIPT SMALL O
2135..2138    ; Common # Lo   [4] ALEF SYMBOL..DALET SYMBOL
2139          ; Common # L&       INFORMATION SOURCE


000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
000D;<control>;Cc;0;B;;;;;N;CARRIAGE RETURN (CR);;;;
0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;

The abbreviatoins in parenthesis aren't part of the control codes's names, they are just abbreviations, and the " (XX)" parts should be deleted.

The abbreviations (for these and more) control characters should be listed in NamesList, not UnicodeData.

10A40;KHAROSHTHI DIGIT ONE;Nd;0;R;;1;1;1;N;;;;;
10A41;KHAROSHTHI DIGIT TWO;Nd;0;R;;2;2;2;N;;;;;
10A42;KHAROSHTHI DIGIT THREE;Nd;0;R;;3;3;3;N;;;;;
10A43;KHAROSHTHI DIGIT FOUR;Nd;0;R;;4;4;4;N;;;;

These should be No, not Nd, since they cannot form radix ten (decimal) digit strings. (This influenses other UCD files too.) [1-4, but no 5-9? Strange! And it does not appear to be base five (non-positional) either. Hm.]

Actually, the SUPERSCRIPT/SUBSCRIPT DIGITs ZERO to NINE should be Nd, since they can form superscript or subscript radix ten digit strings. (Decimal digit strings consist of digits from the same script; **for the purpose** of interpreting digit strings as decimal numerals, "SUBERSCRIPT" and "SUBSCRIPT" should be considered "scripts", even though that is not the case for other purposes.)