From rm@iist.unu.edu Mon May 4 01:01:31 1998 Date: Mon, 4 May 1998 15:45:02 HKT To: mike_ksar@hp.com, ny@csms.edu.mn, corff@zedat.fu-berlin.de, qj@nmg2.imu.edu.cn, everson@indigo.ie, jenkins@apple.com, kenw, becker.osbu_north@xerox.com, me@iist.unu.edu Subject: feedback on Ken Whistler's comments on Mongolian Mime-Version: 1.0 Dear All, Below are comments, agreed between Mongolia and UNU/IIST, on the questions Ken Whistler raises in his document (WG2 N 1734) regarding the Mongolian encoding proposal N 1711. We apologise that they were not ready in time to provide feedback into the Chinese response which Mike distributed last week. Please note that these comments relate only to document N 1734. We do not specifically comment on N 1711 itself because we have not finished studying it completely yet. We will send our comments on that at a later date. With best regards, richard ***************************************************************** Feedback on Ken Whistler's Comments on Mongolian Encoding: N 1734 Mongolia + UNU/IIST ***************************************************************** 1. Mongolian Space ------------------- In Mongolian script, words are often written with case endings separated from the main stem of the word. Further, one stem may have several case endings following it, in which case each separate case ending is written separated from the others. Thus, the form of a single "word" can be: stem caseA caseB caseC where the spaces actually appear as white spaces when the word is printed or displayed on a computer screen. Traditionally, this space separating case endings is called the Mongolian space, and it differs from the normal space mainly in that the letters immediately preceding the Mongolian space are final form variants whereas the letters immediately following it are middle form variants. In addition, the Mongolian space is generally smaller than the normal space (typically one third of the size) and a line of text should not be broken at a Mongolian space. Many arguments have been put forward relating to the necessity or otherwise of including Mongolian space as a separate character, on the one hand claiming that it is fundamentally different from the existing character NBS (no-break-space) and on the other hand claiming exactly the opposite. However, we do not feel that any of these arguments is particularly convincing one way or the other. We do tend to agree that much of the functionality of the Mongolian space is either already present in NBS, or could be specifically incorporated into a "Mongolian interpretation of NBS" as Ken Whistler seems to suggest. However, we can envisage two scenarios in which the NBS might be used in Mongolian which would distinguish it from the Mongolian space. First, the Mongolian language contains a very large number of "composite words", where a series of words taken together represents a single concept, and the NBS could be used to logically "join" these composite words into a single unit, for example for electronic analysis or searching of documents. In such a use, the space between the elements of a composite word would not only be a normal sized space but it would also have a semantically different meaning from the space linking case endings. Second, the NBS could be used, e.g. in educational texts, as a separator to show how a word is constructed from syllables or to show how a derivative word is constructed from its components. Admittedly this could also be done using the format control characters and the variant selectors, but these would be much less efficient in this case. In view of these scenarios, which would be impossible if the Mongolian space were unified with the NBS, we recommend the retention of the Mongolian space as a separate entity from the NBS. 2. Mongolian Combination Symbol ------------------------------- We agree that this character should be retained. We do not care what it is called! We are happy for it to be included in the General Punctuation block instead of in the Mongolian section. 3. Mongolian Positional Format Control Characters ------------------------------------------------- We accept that the different positional variant forms could be indicated using the existing zero-width joiner and non-joiner characters instead of using specific positional form selectors as proposed in N1711 (and previous proposals). However, the system based on the joiner and non-joiner requires not only more complicated input and output algorithms than that using the positional form selectors, but also on average signicantly longer code strings to generate the equivalent sequence of actual characters. A comparison between the two coding schemes, based on Ken Whistler's table, is given in the following table supplied by Mongolia: //********************************************* DISPLAY STORE store (according N1711) _O_ _B_ _B ISF_ _I_ _B J_ _B INF_ _F_ _J B_ _B FIF_ _M_ _J B J_ _B MEF_ The same number of codes is used in two columns. _iO_ _b J NJ B_ _b B ISF_ _iI_ _b J NJ B J_ _b B INF_ _iF_ _b B_ _b B _ _iM_ _b B J_ _b B MEF_ In Mongolian Script, there is no difference for 'i' between 'iO' and 'iF', but they have to insert J in the iO string to distingush it from oO. Therefore the difference of numbers of codes, in the two proposals, is -3. _oO_ _b NJ B_ _b ISF B ISF_ _oI_ _b NJ B J_ _b ISF B INF_ _oF_ _b NJ J B_ _b ISF B _ _oM_ _b NJ J B J_ _b ISF B MEF_ The difference of numbers of codes is -1. _Of_ _B NJ J b_ _B ISF b_ _If_ _B b_ _B b_ _Ff_ _J B NJ J b_ _B FIF b_ _Mf_ _J B b_ _B MEF b_ There is also no difference for 'f' between 'Of' and 'If', so the difference of numbers of codes is -3. _Oo_ _B NJ b_ _B ISF b ISF_ _Io_ _B J NJ b_ _B b ISF_ _Fo_ _J B NJ b_ _B FIF b ISF_ _Mo_ _J B J NJ b_ _B MEF b ISF_ The difference of numbers is -1. _iOf_ _b J NJ B NJ J b_ _b B ISF b_ _iIf_ _b J NJ B b_ _b B INF b_ _iFf_ _b B NJ J b_ _b B FIF b_ _iMf_ _b B b_ _b B b_ The difference is -5. _oOf_ _b NJ B NJ J b_ _b ISF B ISF b_ _oIf_ _b NJ J B b_ _b ISF B INF b_ _oFf_ _b NJ J B NJ J b_ _b ISF B FIF b_ _oMf_ _b NJ J B b_ _b ISF B b_ The difference is -4. _iOo_ _b J NJ B NJ b_ _b B ISF b ISF_ _iIo_ _b J NJ B J NJ b_ _b B INF b ISF_ _iFo_ _b B NJ b_ _b B FIF b ISF_ _iMo_ _b B J NJ b_ _b B b ISF_ The difference is -3. _oOo_ _b NJ B NJ b_ _b ISF B ISF b ISF_ _oIo_ _b NJ B J NJ b_ _b ISF B INF b ISF_ _oFo_ _b NJ J B NJ b_ _b ISF B FIF b ISF_ _oMo_ _b NJ J B J NJ b_ _b ISF B b ISF_ The difference is -1. The total difference is - 17 codes in this part, for example. //********************************************* This latter point implies that documents would require significantly greater storage space and would take significantly longer to transmit electronically. This is of particular concern to Mongolia because the level of computing and communications technology available to normal users is relatively low. In view of this, we would prefer to retain the positional format control characters despite the fact that they provide functionality which can be mimicked by the joiner and non-joiner because we feel that they provide this functionality in a much more efficient and logical way. We would further suggest that, since it is likely that a number of Arabic speaking countries suffer the same lack of state-of-the-art technology as Mongolia, these positional format control characters would additionally offer a more efficient and logical alternative for coding variant forms in Arabic which could similarly benefit these countries. With regards to the Positional Indicator Character (xx1C in document N1691): In document N1691 (and various predecessors) this character, or ones like it, were included in the proposals as a suggested means of generating positional forms (isolated, initial, medial, final) of characters. But as we have pointed out a number of times, beginning with document N1497 which we submitted to and which was discussed at the Sigapore WG2 meeting in January 1997, the use of this (and similar) character(s) in these proposals is logically flawed because strings containing it are ambiguous. More specifically, in N1691 it is stated that: (PIC)X means X is final form X(PIC) means X is initial form (PIC)X(PIC) means X is middle form With this scheme, the string AB(PIC)C(PIC) has two possible interpretations: 1) B and C are both initial forms 2) C is middle form and there is no way of distinguishing these alternatives. This character thus appears to serve no useful purpose (its intended functionality now being provided correctly by the positional format control characters and/or by the joiner/non-joiner) and is logically unsound. We therefore repeat our recommendation that it should be removed. 4. Mongolian Free Variant Selector Characters ---------------------------------------------- Since the maximum number of possible variants of any single positional form appears to be four, three free variant selectors are both necessary and sufficient. We have no preference regarding whether they are considered as Mongolian "characters" or as something more general. 5. Mongolian Vowel Separator ---------------------------- The proposal to use the non-joiner in place of the proposed Mongolian vowel separator, as in the example some letters + ML.NA + NJ + ML.A + FVS2 does not work if the non-joiner is also used to distinguish positional form: the above string would give the final form of ML.NA but the second variant **isolated** form of ML.A. (No, there isn't one! We assume that in this case you'd just get the default variant of the isolated form.) However, the Mongolian Vowel Separator is in any case entirely redundant -- the separated final forms of the ML.A and ML.E characters are available in the character set as variants, so the required string can be generated using only the positional format characters and the variant selectors (We guess this is what Ken meant, but he just got the details slightly mixed). Actually, one could perhaps go further. The letter preceding the separated vowel form is always final form or middle form, and this form is determined by the actual letter (i.e. it is not a matter of choice). So this could perhaps be incorporated into the rules for calculating the default form of a character: e.g. a letter defaults to final form if 1) it is followed by a separator or 2) it is followed by a separated vowel and is one of some particular set of letters (i.e. the ones which are final form not middle form before a separated final vowel) or .... Further, we believe that the separated form is actually the most commonly used final form variant, in which case this should perhaps be the default final form, thereby removing the necessity to use the variant selector FVS2 to obtain the separated form. 6. Mongolian Todo Soft Hyphen ------------------------------ We are not sufficiently familiar with the Todo script to offer any comments on this issue.