L2/09-008 From: Bidyut Baran Chaudhuri Date: 01/07/2009 To: Lisa Moore Cc: Dwijesh Dutta Majumder , DR.PABITRA SARKAR Anupam Basu TAMAL SEN Subject: Bengali Dear Lisa I am intersted in Bengali (a Indo-Bangladeshi script) Unicode which occupies 0980-09FF of plane zero. According to our view, there are several lacuna in the current coding scheme. The following is a brief statement which I request you to put as one "UTC Agenda" (A) Most important drawback of current Bangla Unicode is the scheme to represent the 'compound characters' which phonetically represent combination of two, three and four consonants. The present scheme of rendering them are through the 'hasant'(09CD) code point. Thus, 'hasant' serves two purposes: one, to appear as 'hasant' at the end of a word or after a consonant ; two, as a joiner to form 'compound characters'. However, to render 'hasant' after a consonant, a 'Zero Width Non joiner' [ZWNJ] code should accompany the 'hasant' code. Yet there is a problem to generate two types of compounding of R(09B0) and Ya(09AF). To distinguish the two, another code named 'Zero Width joiner' [ZWJ] is employed. The whole process is cumbersome, unnatural and takes more bits to represent Bangla text. Also, people working on NLP and OCR of Bangla will find it inconvenient to write algorithms with such scheme. Instead we propose the following modifications: 1. The code for 'hasant' (09CD) should be used only to render hasant after a consonant. 2. The 'Zero Width joiner' [ZWJ] code should be used to construct the "compound characters' only (except 'Ya phala' to be rendered as follows. 3. Introduce a code for 'Ya phala' which when used after the code for a basic character will render 'Ya phala' after that character. This scheme will take care of all the "compound character" formation in a more straightforward and compact way. If you are interested, I can give a more elaborate explanation. (B) We do not understand why there is a code point 09D7 assigned to a shape that is neither a character nor its modifier (eg vowel sign) form. The shape shown in your table is the right part of the vowel sign 'Ou kar'(09CC). But such right part alone has no utility (not even historically). I assume it has been introduced with the influence of Devanagari 'O-kar'(094B) which is visually identical. (C) Many code points are unnecessarily filled by some old (historical) character or modifier signs. In future, can we not place the code points for such shapes in the private use plane of Unicode, which is so empty? (D) There is no explicit 'full stop' code point for Bangla. The points reserved for Devanagari 'single' and 'double' full stop (Danda) is to be used for Bangla also. Why separate code for single and double full-stop? There is no harm if single danda code is used twice to depict double stop (danda) and we can save one code point. Such issues may be discussed and resolved in your February meeting. I shall be glad to hear form your experts also. Regards. Bidyut Baran chaudhuri Vice President, Society for Natural Language Technology Research, kolkata