L2/05-111

Comments on Public Review Issues (Feb 4, 2005 - May 4, 2005)

The sections below contain comments received on the open Public Review Issues as of May 3, 2005, since the previous cumulative document was issued prior to UTC #102 (February 2005). Feedback was received on N closed issues. Also included at the end is other feedback that did not fit into the open issues.

58 Closed Issue: Characters with cedilla and comma below in Romanian language data

Date/Time: Thu Feb 17 11:12:33 CST 2005
Contact: Mirko Janc
Subject: Romanian characters

I am not a Romanian but I have a big experience with Romanian technical books of all kinds. I used to have (and read) more than 300 Romanian math (and related books) printed all in the pre-computer-typesetting era. Officially, the "accent" under "t" and "s" is comma, not cedilla. I also used to freelance-proofread for a prestigious art history journal a number of Romanian excerpts from the times they used Cyrillic (before 1850's) and ever since.

Cedilla was a poor man's solution to have something under the letter. Both letters are in widest use in many words.

Mirko Janc, PhD
(formerly professor of mathematics)
Chief Publishing Technologist (and TeX expert)
INFORMS
7240 Parkway Drive, Suite 310
Hanover, MD 21076
(443) 757-3577

53 Proposed Draft UTR #33 Unicode Conformance Model

No feedback received this period.

59 Disunification of Dandas

No feedback received this period.

61 Closed Issue: Proposed Update UAX #15 Unicode Normalization Forms

Date/Time: Thu Apr 14 15:16:53 CST 2005
Contact: Erik van der Poel
Subject: tr15-25.html

Annex 6 Legacy Encodings, D5: 0x5C is BACKslash (\), not forward slash (/).

the '/' semantics of 5C → the '\' semantics of 5C

63 POSIX Data for CLDR

(Feedback goes to the CLDR-TC.)

64 Draft UTR #36: Security Considerations for the Implementation of Unicode and Related Technology

Date/Time: Wed Feb 23 15:11:30 CST 2005
Contact: Erik van der Poel
Subject: confusables

Hi Mark, Great work! I thought it would be easier for implementors if the confusables table had Unicode code points in it (e.g. U+1234). This is for: http://www.unicode.org/reports/tr36/confusables.txt

I don't know whether the other table should contain U+1234's: http://www.unicode.org/reports/tr36/idn-chars.html

I just noticed that the main doc says that you can hover over the chars, but the actual doc itself should also point this out (or just include the code points).

You mention addressing the problem at the registry level and a 2nd line of defense in the apps. I don't know whether you've seen my upstream and downstream email on the IDN list, but I suggest that these issues can and should also be considered upstream from the registries, namely at the nameprep level. The mailing list is currently discussing whether to ban, map or warn about homographs, so the Unicode Tech Report might want to just mention that the next version of the nameprep spec might want to address these issues. Of course, you already state that the report is also aimed at standards developers, but an explicit mention here (near the "additional line of defense"?) might be good.

Re: ICANN guidelines, as you know, I think they're way out of whack. Instead of talking about languages, they should at least talk about scripts, as you say, but I think it could be taken a step further, to "writing systems", or indeed, whatever rules a TLD may wish to use e.g. XML-cccccccc.ru where the cccccccc is "documents" in Russian/Cyrillic.

Re: 5 levels, interesting idea! You may not be able to avoid syntax characters even at level 5, since nameprep already allows some of those, and the IDN list is currently discussing whether it will ever be possible to change nameprep in such a way that a new ACE prefix would be required. Banning syntax chars would require a new ACE prefix. The registries have to follow the nameprep RFC in order to be compatible with the apps, so the registries cannot go to level 5 if it prohibits syntax chars (unless we change nameprep). The user agent is another matter; it can display such things specially.

Maybe the 5 levels should be outside the User Agent section, since the Registries section refers to it.

I don't know if you want to add a reference to nameprep: http://ietf.org/rfc/rfc3491.txt

IDNA is 3490 and Punycode is 3492.

May want to include a reference to this: http://secunia.com/multiple_browsers_idn_spoofing_test/

I like the original title Unicode Security Considerations.

Re: syntax characters, the IDN mailing list has been discussing these recently, and I started a table at nameprep.org, calling them "delimiters" but maybe "syntax" is a better word?

the worlds characters → the world's characters
a background info → background info
and as was → as was
breech → breach
san-serif → sans-serif
serifed → seriffed?
non-ASCII character → non-ASCII characters
allow allowing → allow
or may contain → may contain
fonts increasing → fonts increasingly
different in shape that → different in shape from

'The term "Registry" is to be interpreted broadly.' kinda out of the blue here, how about prefixing that with "In the following, "

StringPrep RFC XXX → 3454
stringprepped form → nameprepped form
IDN chars, after stringprep is performed → after nameprep is performed?
Eric van der Poel → Erik van der Poel (thanks)

Date/Time: Sun May 1 01:28:47 CST 2005
Contact: Mati Allouche
Subject: TR36 corrigendum

1) In section 2.2, the table titled "Cross-script Spoofing" has irrelevant comments (remainders of copying the table in section 2.1).

2) In section 2.3, the table titled "Spoofed Domain Names" has incorrect comments for case 5a (should refer to z, not to o).

3) In 2.10 "Recommendations", "because is" → "because it is"

4) In section 2.10.3, item B.3, it is not clear to which condition the "otherwise" clause refers. If it is to "If the confusable data is available", then this clause should be at the same level (C).

5) In Appendix B, "Constutuency" → "Constituency"

Date/Time: Sun May 1 01:43:28 CST 2005
Contact: Mati Allouche
Subject: TR36 comment

In section 2.10.3, item 2 of the To Do list: the process is not necessarily valid. For example, assume the logical Bidi string "CENTRAL.123", which displays as "123.LARTNEC". Applying reverse Bidi to that will give (assuming LTR paragraph level as prescribed for URLs) "123.CENTRAL", which does not match the original. However, the original string is perfectly valid (and is inspired from a real-life case). I suggest that the recommendation should be that the registrar should check if there exist registered names with the same display as the new string, and reject the new string if yes, and/or recommend that a registrant always register all logical strings which have the same display.

65 Encoding of Devanagari Eyelash Ra

No feedback received this period.

66 Encoding of Chillu Forms in Malayalam

Date/Time: Thu Mar 17 03:59:17 CST 2005
Contact: vinod @ arya.ncst.ernet.in
Subject: Encoding Chillu in Malayalam-PR #66

> Public Review Issue #66
> Encoding of Chillu Forms in Malayalam
>
> In The Unicode Standard, Version 4.0, chillu forms and explicit virama forms
> are distinguished in text by the use of joiners. For example:
>
> <NA, VIRAMA, ZWNJ> to represent a NA with a visible virama
>
> <NA, VIRAMA, ZWJ> to represent the chillu N
>
> However, the standard is not clear about what <NA, VIRAMA>(without any joiner)
> represents.

The ZWNJ and ZWJ in the examples shown here are for entering the three-level precedence hierarchy of conjoining consonants. It is not for claiming that <NA, VIRAMA, ZWJ > is the only way to generate a Chillu-N. The Unicode standard (4.0 Chap 9.1 page 222 and 223) clearly expresses the precedence order for conjoining consonants and how the user is given the control to choose a representation lower in the order. Most recently the PR 37 on Clarification and Consolidation of the Function of ZWJ in Indic Scripts (Peter Constable 2004-06-30 - Page 14) has made this crystal clear.

The Unicode standard is quite clear that <Na, Virama> without ZWJ can be shown as ChilluN if the form is available. Section 9.9 Malayalam (Unicode 4.0 Page 249) explicitly states "Five sonorant consonants merge with the virama when they appear in syllable-final position with no inherent vowel". The example below the statement of the three forms of <Na Virama Ma> are for showing how the ZWJ and ZWNJ can be employed for selecting a form lower in the order. People who jump to the conclusion that the Chillu-N can only be formed from <Na Virama ZWJ> should also see Figure 9-5 Half-Consonants (Unicode 4.0 Page 223) and insist that the half-form of Ka can only be formed by <Ka Virama ZWJ>. In Devanagari, the half-form cannot be formed when the <Ka Virama> is at the syllable-final position. Whereas, the chillu form in Malayalam can be formed even at the syllable-final position, not to speak about other positions. The Page 249 statement along with other statements implies this.

> Some readers have interpreted the standard such that this sequence
> represents NA with a visible virama, others have interpreted it to be
> the chillu N, and yet another possibility is that the sequence does not
> specify one or the other form (i.e. the rendering system can choose).

The Unicode 4.0 Standard and the accepted PR 37 imply that the rendering system has to show <Na Virama> (without ZWJ) as Chillu-N first and if it is not available then with explicit Virama. The rendering system should not show the sequence as illegal or with explicit virama if the chillu-form is available.

> The following 5 chillu forms in modern use have been proposed:
>
> MALAYALAM LETTER NN
> MALAYALAM LETTER N
> MALAYALAM LETTER RR
> MALAYALAM LETTER L
> MALAYALAM LETTER LL
>
> There is at least one chillu form no longer in modern use, for KA.
>
> If chillu forms are not explicitly encoded, the standard at least needs to be
> clarified such that specifically represents one or the other form.

From: Cibu
Date: 2005-03-22 18:30:55 -0800
Subject: on Public Review Issue #66: Encoding of Chillu Forms in Malayalam

Hi,

Since Chillu-NA and NA + visible VIRAMA can give different meaning to a word, we cannot let the rendering system choose. Therefore, here are my preferences in the decreasing order:

1) Explicitly encode Chillu characters. Various issues are discussed in detail below. 2) <NA, VIRAMA> (without any joiner) should be mapped to NA with a visible Virama because, it will enforce uniformity. That is, Consonant + VIRAMA will form visible Virama symbol, irrespective of whether the consonant is capable of forming a Chillu or not. Example SA + VIRAMA and NA + VIRAMA will have visible Virama symbol.

Issues in current representation of a Chillu letter as Consonant + Virama + ZWJ

1) ZWJ and ZWNJ are supposed to be font directives, directing a font to select from two or more semantically same renderings. In case of Malayalam, this is no longer true. ZWJ becomes an alien language construct introduced to Malayalam by Unicode to produce Chillu letters. Thus, it is possible to produce two semantically different words, which differ only by ZWJ in their Unicode representation. Example: അവന്‍ (avan – meaning 'he') & അവന്‌ (avan~ - meaning 'for him')

2) When a word is searched in Unicode text, the search algorithm should ignore ZWJ & ZWNJ because it should not care about the rendering of the word. From the first reasoning, this does not hold good for Malayalam. However, if search algorithm does not ignore ZWJ & ZWNJ, then it surely is going to miss some words, which are semantically same but rendered differently by using/omitting ZWJ/ZWNJ.

3) Chillu of a consonant is different from its C1-conjoining form without inherent അ (A).

3.1)Phonetic differences Consider the combination: Vow + CC + Con. Vow - a vowel CC - a consonant capable of forming Chillu Con - a consonant

When CC takes its Chillu form, it is joins more with Vow. This effect produces a noticeable small stop between CC and Con.

When CC takes, its C2/C1-conjoining forming form without inherent അ (A), it is pronounced closer to Con.

Examples: ഉണര്‍വ്‌ ഉണര്വ്‌ (unlike its pair, not a meaningful word) കല്‍വിളക്ക്‌ വില്വാദ്രി കണ്‍വട്ടം കണ്വന്‍

4) Chillu of a consonant can be treated as Anusvara A. R. Raja Raja Varma states in his Keralapanineeyam (which is the foremost grammar book of Malayalam) "Anusvara is the Chillu of MA". Thus, we can say that Malayalam has more than one Anusvara. There is Anusvara for MA; there is Anusvara for NA, NNA, LA etc. This is essentially same as saying Malayalam got some number of Chillus, which includes MA, NA, LA etc.

If we look closely, the phonetic rules are also same for Anusvara and other Chillus. Most importantly the half stop property (please see Appendix A), if it occurs in the middle of a word. Examples:

സംയുക്തം സാമ്യം കല്‍വിളക്ക്‌ വില്വാദ്രി കണ്‍വട്ടം കണ്വന്‍

Essentially this means Unicode should do either of: 1. Include separate character locations for Chillu characters - solves the confusion of ല്‍ (Chillu of LA/TA) (see below) - Addresses above mentioned Chillu representation issues 2. Allow Anusvara to be encoded as MA + Virama + ZWJ - does not change existing encoding for Chillu - does not address previously explained Chillu representation issues

Background ----------

A) Overloading of visible Virama in Malayalam

Following are its functions: A.1) at end of a word, it acts as quarter vowel ഉ (U). Example: അവന്‌ (avan~) A.2) In the middle of a word, it means the consonant before is forming a conjunct with consonant after. Example: ശബ്‌ദം (Sabdam) In this context, it does not produce any sound what so ever. Functionality-(A.2) has been overloaded with this grapheme when typesetting friendly new orthography has been introduced. Unicode recognizes functionality-(A.2) alone with visible Virama of Malayalam. This contributes to the problem that Unicode representation of അവന്‍ (avan) & അവന്‌ (avan~) being different only by ZWJ/ZWNJ.

B) Evolution & Confusion of ല്‍ (Chillu LA/TA) For Sanskrit words used Malayalam, ത (TA) is pronounced as it is, only when a vowel or semi-vowel comes after it. For all other occasions, it is pronounced as ല (LA).

An example would be ഉത്സവം (ulsavam). Even though, it's Sanskrit originated form is ഉത്‌സവം (uthsavam), it is pronounced in Malayalam as ഉല്‌സവം (ulsavam).

This means, Chillu form of ത (TA) should be pronounced as if it is Chillu form of ല (LA). Thus, ല്‍ (chillu LA/TA) is in a very curious situation:

B.1) Grapheme level: Graphically it is Chillu of ത (TA). B.2) Character level: It can represent the characters – either ത (TA) or ല (LA). B.3) Phoneme level: Its pronunciation is the Chillu of ല (LA).

Reference: കേരളപാണിനീയം (kEraLapaaNineeyam), പീഠിക (peeThika) - A. R. Raja Raja Varma

thanks,
Cibu

-- More about me: http://www.blogger.com/profile/1246232

Date/Time: Tue May 3 18:31:33 CDT 2005
Contact: Antoine Leca

We are required to go one step back than what is exposed to understand a bit more of the issue. I am sorry to the lengthy explanation. I am sorry for the lack of definitive, clear-cut, answers.

The model for use of the joiners in Indic conjuncts in the framework of The Unicode Standard, has been designed and refined over the years. (This discussion is extracted from the very good exposition from Peter Constable, which I want to thank here, in a paper made available as Public Review 37, Spring of 2004; this paper discusses the “other” scripts, where it is more often C2 which is changing form; yet the discussion is worth considering here, because the funding Devanagari as well as the Malayalam script under study behave the same at this respect.) It considers three variations in the way a conjunct can be rendered:

a. a specific glyph is used for the conjunct;
b. a generic form (traditionally called half-consonant in Devanagari and by extend in the other scripts) is used for the dead C1 consonant, and the normal form for C2 is used;
c. the dead C1 is shown with a visible mark (called halant हलन्त in Hindi, candrakkala ചന്ദ്രക്കല in Malayalam), and the normal form for C2 is also used.

It also distinguishes three sequences for a conjunct, in order:
    1. <C1, VIRAMA, C2>
    2. <C1, VIRAMA, ZWJ, C2>
    3. <C1, VIRAMA, ZWNJ, C2>

Each one of the sequence express a restriction over the preceding one. That is, when the sequence 1 is used (by the writer), the rendering engine should use the first available of the three ways: that means no restriction, and the most appropriate form is to be used. When the third kind of sequences (using ZWNJ) is used, only the c form is acceptable. Till there, this is the basic model (as described in The Unicode Standard, version 1.0, volume 1, 1991, and it is even in accordance with about any other use of the joiners. Applied to Malayalam, this means that if a conjunct exist (as in the prototypical N.MA ന്മ example), the sequence <U+0D24, U+0D4D, U+0D2E> should show it, while the sequence <U+0D24, U+0D4D, U+200C, U+0D2E> should render ന്മ instead, using the default way to render a conjunct. On the other hand, when there are no specific glyph for the conjunct, as for example for the case of L.THA ല്ഥ (as a stupid example of some meaningless conjunct that is not expected to appear), the rendering will be always the same, using candrakkala ചന്ദ്രക്കല.

Thereafter, the various evolutions of The Unicode Standard introduced the intermediary step, the sequence of kind 2, with ZWJ. And the assigned meaning was to restrict to only the two latter renderings, or in other words to disallow the use of a specific glyph to render the conjunct (at least, as it appears printed.)

Let makes a small stop at this scheme. One striking point here is that it goes against the intended use of ZWJ: ZWJ would be used logically to request a closer conjunct, that is a representation that is occurring first or before in the a b c list; such a case occurs in modern Devanagari, with conjuncts like ट्ट TT.TTA or ङ्ख NG.KHA, which are nowadays often shown with visible halant, while the traditional Sanskrit form is to use a stacked conjunct; as a result, the simple sequence (kind 1) <U+091F, U+094D, U+091F> is usually rendered by glyphs as in scheme c, and none of ZWNJ or ZWJ (under the current rules) would have any effect, since c is already the more restricted option. And there are no obvious ways to request a stacked conjunct, that is a representation according to the scheme a, with a dedicated glyph.

Beyond this particular case, this scheme appears to work for Devanagari (at least when one does not tries to re-use the same character for another meaning, as it happened with the so-called eyelash-ra).

Here we need a second parenthesis. In the schemes b and c, the C2 consonant is always unmodified with respect to its standalone form, that is the form it would have if it stood outside any conjunct. As a result, the same process can be applied when there are no C2 consonants, particularly at the end of a word. However, there is an important difference here: the a scheme (the specific glyph) in such a case, is… the c scheme, to use the halant हलन्त! So in such a case, the sequence <C1, VIRAMA, ZWJ> should be understood as a request to use scheme b (if available). [In fact, it is historically how ZWJ was introduced into this game: it was equated with ISCII-88 character DB, INV, the invisible consonant (D9 in ISCII-91), which effect as C2 consonant in a conjunct is to force the use of scheme b, lacking any “specific glyph” proper of a scheme a. Yet ISCII INV has many other possible uses, many of them are not achieved using ZWJ in today’s Unicode.]

The whole scheme was also thought to work for Malayalam, considering the cillakṣaram ചില്ലക്ഷരം in a similar way as Devanagari half-consonants, in such a way that they are not used if a specific glyph exists, but they are preferred over the use of the visible candrakkala ചന്ദ്രക്കല. That is, use of cillakṣaram ചില്ലക്ഷരം for the C1 consonant is preferred over the use of candrakkala ചന്ദ്രക്കല, but a specific conjunct is definitively better. This axiom was established as the unwritten rule to render Malayalam.

When it comes to the final position in a word (the principal use of the cillakṣaram ചില്ലക്ഷരം), the same general rule could be used: sequence 2 and 3 restrict to less usual sequences.

As a result, we have the following rules for rendering:

To answer the specific question of the issue, under this scheme, the sequence <U+0D28, U+0D4D> is either a part of a conjunct (like in N.MA ന്മ), or it is the cillu n ന്.

While this explanation might appear clear and logical exposed this way, the mere fact there is this issue shows it did not succeed; I believe there are three reasons for this lack of success.

First, while the schemes for rendering Devanagari or Tamil was explained in details as soon as 1992, it was not before 2001 that The Unicode Standard, then at version 4.0, cared to explain how Malayalam “worked”; furthermore, at the same time an agency of the Kerala government published another, quite distinct, standard about the use and rendering of Malayalam using the Unicode encoding as a basis; and to add more confusion to the case, a leading operating system publisher studied in about the same time frame its own “solution” to deal with the Malayalam script, a solution which when it was published in 2004 appears to be neither the scheme explained above nor the scheme proposed in Kerala. Given this confusion, it is understandable that the other people that are trying to implement Malayalam rendering are either asking for support, proposing inadequate solutions, or simply postponing the development.

A second important point is that the scheme above was conceived as a derivation of the Devanagari case; of course, it was to be expected, since ISCII also is highly based on the rules for Devanagari; and the scholar material available to the “Western” experts are often biased toward Devanagari too. This would not be too much of a problem if this did not raise two important consequences:

— First, cillakṣaram ചില്ലക്ഷരം are not shown in the exposition the way the speakers of Malayalam see them. The above exposition might lead a casual reader to think that I consider cillakṣaram ചില്ലക്ഷരം to be an equivalent for Devanagari half-consonants (those without the right leg): but they would be wrong: I just observe that cillakṣaram ചില്ലക്ഷരം behave the same in as much they are preferred to halant form, but a specific conjunct would be preferred; and as such a similar mechanism can be conceived; I do not see more similarities; for example, I know that the glyph for the cillus are expected to be seen on a keyboard layout, but the writer would not expect such a key to be the equivalent to <C, U+0D4D>. I am really speaking about a deficit of explanations and more generally of attention toward the native audience.

— Then, the plain form of cillakṣaram ചില്ലക്ഷരം is a 3-codepoint sequence; since they are a often used in Malayalam, this creates a clear overhead, that goes against the very nature of the virama model (which is based on the observations that the a vowel is by far the most frequent, and that conjuncts are rarer than simple consonants: this makes ISCII and hence Unicode quite economical encodings for the Indian languages.)

The result is that the process is difficult to understand unless one is familiarized with Hindi rendering (!), and it ignores basic facts such as the existence of keys that create directly the 3-codepoint sequences (to be sure to represent the cillakṣaram ചില്ലക്ഷരം, even if it should not be the sequence to use, as we will see shortly).

The third reason is another process that is occurring meanwhile. Since about 40 years, the government of Kerala is trying to promote a reform of the script. The base for this reform was to reduce the number of ligatures; the impact is particularly important in two areas: the education, and the printing industry (it was said there was a need of 900 glyphs to print Malayalam; of course this was more a concern with lead typography than it is now; yet it is still easier to create a 150-glyph electronic font than one with 900!) I cannot form a definitive judgment about whether or not the reform will finally succeed, in some 40 years from now. But for the present generation as a whole, which did learn the traditional style at school, and are the users of Unicode right now, there is a clear need to be able to deal with the two forms. Of course one can encounter zealots in both camps; but I feel Unicode should ignore them and provide a solution that “works everywhere”.

And in such a context we encounter again with the problem I exposed above, about the loss of the original function of the ZWJ, to ask for a more compact yet irregular rendering. Consider Y.K.KA യ്ക്ക and L.K.KA ല്ക്ക, two not common yet not uncommon conjuncts; under the traditional style, they were represented with a stacked conjunct, the C1 consonant in nominal form and below it, a subjoined form of the K.KA ക്ക conjunct, usually lacking the top part.

With the reform, both conjuncts are declared obsolete. For the former, this means that <U+0D2F, U+0D4D> is considered apart and rendered with candrakkala ചന്ദ്രക്കല, as യ്, followed by the K.KA conjunct ക്ക; on any Malayalam keyboard layout, the typing order will stay the same, using five keys; in any case, the resulting encoding will be compatible with the traditional rendering: it only needs to render it with a font that has the old conjunct. But for the latter, <U+0D32, U+0D4D> will be rendered as ല്, cillu l; and this glyph will be independently present on keyboards, so a writer can legitimately enter it using the key (rather than the two-key sequence LA ല then candrakkala ചന്ദ്രക്കല); and the software solution will insert then a ZWJ inside the conjunct (in order to keep the inferred intention of the author to use a cillakṣaram ചില്ലക്ഷരം), when there are no such intention; worse, lacking the view of the result while using traditional fonts, the author will not note anything wrong: after all, there are no difference in the reformed style between <0D32, 0D4D, 0D15, 0D4D, 0D15> and <0D32, 0D4D, 200D, 0D15, 0D4D, 0D15> [I do not know if this is a typo, but I have seen such a distinction in a small glossary, written in traditional style: അയല്ക്കാരന് “neighbour” uses a cillakṣaram ചില്ലക്ഷരം, similar to the base word അയല്, while പാല്ക്കാരന് “milkman” does not and uses a complex conjunct.] This problem occurs because there is no way to indicate that a given cillu could be “swallowed”, or should not, into a larger conjunct, if presented in a different context.

It is important to note here that such a concern has NOT been studied by the Kerala IT Mission, since they restricted their field to the reformed style.

If the present model did not succeed, what could happen with the “solution” to encode five more codepoints?

First, it should be clear that these new codepoints introduce a whole new bunch of possible combinations, so it might be expected that all the cases exposed here could be covered. Yet, the resulting complexity will not allow easy implementations, at least for a complete solution that covers both traditional and reformed styles. Of particular interest here is the fact that the new codepoints are a new kind of animals, they are not consonants but they are not completely dead consonants (since there exist at least one staked conjunct, <U+0D28, U+0D4D, U+0D31>, pronounced /nṯ/, shown as cillu n on top of ṟa, which should probably be encoded using the new codepoint.)

Then, since the new codepoints will somewhat replace the present <C, VIRAMA, ZWJ> sequence, one should define rules to recode existing text; the same rules could also be implemented in the rendering engines, in order to deal with the deprecated sequences; as it is implied by the text of the issue, the difficulty here would be to identify common rules among the current implementations and the uses that could have been done of them.

Another difficulty is that to date, the only usable proposition for the rendering rules, is the one which was issued by the Kerala IT mission. But as I said above, it fails to address the issue of the traditional script, which is a substantial problem of its own.

Introducing new codepoints will clearly help for the first problem, as it could be seen as an acceptation of the 2001 proposal from Kerala (or an improvement over it.) Since it would make Malayalam substantially different from Devanagari, this could also been seen positively. Regarding the second problems, right now nothing can be said: it depends entirely on the rules that are to be fixed regarding the handling of the unobvious sequences, or equivalently the ways to encode the special cases. The third problem is entirely open at this point: I believe these new codepoints could be very well adapted to the reformed Malayalam; but I am not that sure it will allow to encode every other text in traditional style; and I doubt it will allow to legibly display in traditional style a text that would have been composed by a writer which only knew the reformed script.

67 CLDR Version 1.3 Beta

(Feedback goes to CLDR-TC.)

68 Proposed Update UTS #10 Unicode Collation Algorithm

Date/Time: Sat Apr 9 02:59:28 CST 2005
Contact: Ake Persson
Subject: allkeys-4.1.0

Current order:
0CB9 ; [.19F4.0020.0002.0CB9] # KANNADA LETTER HA
0CBD ; [.19F5.0020.0002.0CBD] # KANNADA SIGN AVAGRAHA
0CB3 ; [.19F6.0020.0002.0CB3] # KANNADA LETTER LLA
0CDE ; [.19F7.0020.0002.0CDE] # KANNADA LETTER FA
0CBE ; [.19F8.0020.0002.0CBE] # KANNADA VOWEL SIGN AA

Expected order:
0CB9 ; [.19F4.0020.0002.0CB9] # KANNADA LETTER HA
0CB3 ; [.19F5.0020.0002.0CB3] # KANNADA LETTER LLA
0CDE ; [.19F6.0020.0002.0CDE] # KANNADA LETTER FA
0CBD ; [.19F7.0020.0002.0CBD] # KANNADA SIGN AVAGRAHA
0CBE ; [.19F8.0020.0002.0CBE] # KANNADA VOWEL SIGN AA

Date/Time: Tue Apr 12 12:48:51 CST 2005
Contact: Tex Texin
Report Type: Error Report
Subject: TR10

Hi,

From the online version of tr10:

1) "3.1.1.3 Other Multiple Mappings Certain characters may both expand and contract: see Section 5.17 Sorting and Searching."

There is no 5.17. I think this should perhaps be changed to sec. 8 on searching and matching.

2) in Sec. 3.1.3 Rearrangement there is a reference to: the Logical_Order_Exception property which links to http://www.unicode.org/Public/UNIDATA/PropList.html

This links to a referral page and should be changed to reference UCD.html perhaps- http://www.unicode.org/Public/UNIDATA/UCD.html#Logical_Order_Exception

The referral page http://www.unicode.org/Public/UNIDATA/PropList.html  would be friendlier if the ucd.html reference was a link to the ucd.html page. (As well as having the logo, header, and some other links to find your way around the site...)

hth tex

Date/Time: Thu Apr 14 08:11:28 CST 2005
Contact: SADAHIRO Tomoyuki
Subject: completely ignorable versus level 3 ignorable

My feedback concerns PRI#68 (Proposed Update UTS #10).

My question is about the difference between [.0000.0000.0000.0000] and [.0000.0000.0000.nonzero].

Removal of S1.3 should suppose that CGJ (whose collation element is [.0.0.0.nonzero]) is completely ignorable.

If so, change #3 for Data tables version 4.1.0 about U+0600 ARABIC NUMBER SIGN and U+2062 INVISIBLE TIMES and like characters makes no sense, doesn't it? Because these character must be still completely ignorable after the change from [.0.0.0.0] to [.0.0.0.nonzero].

If CGJ is level 3 ignorable rather than completely ignorable, I think S1.3 may be able to be retained; in such a condition, control-A and like characters whose collation element is [.0.0.0.0] will be removed at S1.3 (before S2.1), while CGJ and like characters whose collation element is [.0.0.0.nonzero] will be removed at S3.5 or S3.7 (after S2.1).

Thank you. SADAHIRO Tomoyuki

Date/Time: Wed Apr 20 09:56:47 CST 2005
Contact: Pierre Clavel
Subject: Collation algorithm (tr10-13)

Very interesting report. I have two comments: 1) As you say in 1.8.7, differences in collation are not written in stone. Often imposed by early systems, simplifications should be welcome. Local rules are not always applied. For instance, my copy (of 1993) of the (probably) most widely used French dictionary, 'Le petit Robert', does *not* use the French accent ordering. More surprising, my copy (of 1997) of a German-French dictionary (Pons / Weis & Mattutat, made by Germans primarily for a German audience) uses the 'telephone' ordering. The fact that German has two sets of sorting rules might actually be temporary. There is a debate among German libraries to change the ordering of catalogues from 'dictionary' to 'telephone'. 2) Another cause for customization you could mention is context or kind of data. In example 1.6, 'di Silva' > 'diSilva' because it is e.g. an author index. In a title index however, one would rather expect 'di Silva' < 'diSilva' (since the system cannot know whether 'di' is a particule or a preposition and few would bother encoding a zero-width space here). In this respect, I don't really understand why you defaulted the variable weighting to 'shifted' rather than 'non-ignorable'.

Regards Pierre Clavel
Swiss National Library

Date/Time: Thu Apr 21 11:27:19 CST 2005
Contact: Ake Persson
Subject: allkeys-4.1.0d9.txt

U+0E2F, U+0E46 should be sorted before U+0E4F (UTS#10 4.f.).

Similar for U+0EAF, U+0EC6.

What about U+0EDC, U+0EDD in combination with U+0EC0..U+0EC4? Reference: L2/03-185R.


Other Feedback

Date/Time: Thu Apr 7 05:11:54 CST 2005
Contact: Andrew West
Subject: Old Persian Word Divider Line Break Class

I was playing around with James Kass's Old Persian test page at http://home.att.net/~jameskass/opctest.htm  (try viewing using latest version of Code2001), and I noticed that my line break and word select algorithms for Old Persian were not working as U+103D0 [OLD PERSIAN WORD DIVIDER] is being treated as a non-breaking character, following the line break class of AL assigned to it in LineBreak.txt. I would have thought that it would have a line break class of BA (cf. UGARITIC WORD DIVIDER which has a line break class of BA). Is this an oversight, or is there a rationale for assigning U+103D0 a line break class of AL ?

Date/Time: Thu Apr 7 09:10:26 CST 2005
Contact: Kent Karlsson
Subject: Linebreak+Thai/Lao and Linebreak+Bidi

Linebreak properties for two Thai/Lao dandas: =============================================

0E2F;SA # THAI CHARACTER PAIYANNOI 0EAF;SA # LAO ELLIPSIS

PAIYANNOI is a danda (full stop-like), and even used in abbreviations in the same way as full stop is (I'm not sure if other dandas are used for abbreviations, but the Khmer danda (KHAN) apparently is). Both PAIYANNOI and LAO ELLIPSES are apperently used as ellipes.

All other dandas have the BA linebreak property. So should PAIYANNOI and LAO ELLIPSIS.

Linebreak and bidi props: =========================

Linebreak treats (by default) almost all C0 control characters as if they were "combining marks". For some there are exceptions. However, some do not get excepted from that, even though that would be expected: 000B;CM # <control> 001C;CM # <control> 001D;CM # <control> 001E;CM # <control> 001F;CM # <control>

U+000B, line tabulation (VT), should have the BK line break property, just as Form feed.

U+001C, U+001D, and U+001E should (by default) have the BK property, since bidi (by default) considers these to be paragraph boundary (B) characters.

U+001F should (by default) have the BA property, since bidi (by default) considers this to be a segment boundary character (like HT/tab).

Alternatively, the (default) bidi property for the four latter (the ISn) should be BN, keeping their CM linebreak property. Line tabulation should still get the linebreak property BK.

(Side remark: I did encounter a range of printers (IBM SureMark) that uses GS (IS3, U+001D) as an additional esc-sequence introducer (they also use ESC for other printer commands). So I would not mind if U+001C-U+001F all got default bidi property BN instead of B or S.)

One would also expect
0089;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION WITH JUSTIFICATION;;;;
to have the (default) bidi property S, and the line break property BA, but at least there is no apparent conflict (and maybe nobody cares about U+0089 anymore).

One would also expect 0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;; to have the line break property ZW (instead of CM), just as ZWSP, since U+0082 can be used for ZWSP in some legacy character encodings.

Likewise, these two
001A;<control>;Cc;0;BN;;;;;N;SUBSTITUTE;;;;
FFFD;REPLACEMENT CHARACTER;So;0;ON;;;;;N;;;;;
should have more similar properties, both for bidi (ON) and for linebreak (AI), since U+001A can be used as replacement character for many legacy character encodings.

Date/Time: Thu Apr 7 09:46:41 CST 2005
Contact: Kent Karlsson

Bidi and paragraph leading spaces =================================

Section separators (anywhere on the line) and whitespace at the end of a line are reset to the paragraph embedding level. However, leading whitespace characters still may end up at a "visually non-leading" position. For example, (with the same convention as used in UAX 9) for right to left paragraphs, "...CAR" comes out as "RAC..." (were . is space), while "...car" comes out as "...car". If the spaces are used as (admittedly simplistic, but used in plain text) paragraph start indentation, the latter is then wrong (for right to left paragraphs), and should be "car...".

Suggested new (or replacement) text is marked with >> <<:

-------------suggested new text in L1--------------
3. any sequence of whitespace characters preceding >>or succeding<< a segment separator or paragraph separator, and

4. any sequence of >>whitespace<< characters >>at the beginning of the line and << at the end of the line.

[...] this means that >>leading white space will appear at the visual beginning of the line or segment and<< trailing white space will appear at the visual end of the line >>or segment<< (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph. ---------------------------------------------------

Bidi and SHY shaping ====================

Shaping is said to logically occur after all the steps of the bidi algorithm. While that has problems in general, in particular in relation to line breaking, I would just like to point out that SHY should be "shaped" after line breaking but before rearrangement (L2) (so that the determination of whether it is at a line end is not messed up). Thus the input "CAR RUN bil<shy>buren TURN" may be displayed bil- NUR RAC NRUT buren

Note also that the SHY must NOT be removed at step X9, even though it is (currently) BN, since it may actually be visible ("shaped" to one or other actual hyphen later on).

Date/Time: Wed Apr 20 09:49:20 CST 2005
Contact: Kavi Arasan
Report Type: Error Report
Subject: Tamil Range (0B80-0BFF)

As per the document Tamil Character Names used in unicode were based on ISCII 1988.

Some of the basic name grossly misrepresent the actual names as per the established Tamil Grammer.

You may be aware that Tamil language has history of more than 2000 years and our anciet grammer dates back to B.C.

Also Tamil has been declared as a classical language by Government of India.

Having said the history, I request you kindly inform the ways to correct this misrepresentation.

Following are the misrepresented character names

0B9A - Tamil Letter JA , it is Tamil Grantham Letter JA
0BB6- Tamil Letter SHA, it is Tamil Grantham Letter SHA
0BB7 - Tamil Letter SSA, it is Tamil Grantham Letter SSA
0BB8 - Tamil Letter SA, it is Tamil Grantham Letter SA
0BB9 - Tamil Letter HA, it is Tamil Grantham Letter HA.

With Unicode becoming more and more acceptable standard.

Rest of the other oraganisation are now claiming to change the nature of the language based on Unicode.

I request you help us fixing this error, by suggesting the ways to get it corrected.

Thanks and Regards

Kavi arasan